Detection problem with Mobile Net on video from host

pierreia

Hello,

I'm trying to use my OAK-1 camera to do inference on a video from host. However, it seems that there is a problem with detection, nothing is detected correctly on the video, even if we tried from different hosts and different models. I'm using this sample code from the docs. It's like the MobileNet is not synced with the frames, or that there is a problem during the image transfer to the model. We tried to set the model with blocking queue also, but nothing changed. Here is an example of what we get on this youtube video:

Has anyone any idea on what could be the problem here? We really hope to get more than 30 fps with satisfying detection results on video from host...

Thank you,
Pierre

erik

Hi @pierreia ,
This is sort of a common issue, the culprit is that NN expects different aspect ratio input (1:1?) than what the frame is (16:9?). We have docs on this here, hopefully these will be useful!
Thanks, Erik

pierreia

Thank you for you answer. I just realize that I put the wrong link in the main topic, so it may not be clear that I'm trying to use MobileNet on a video file from my computer. I'm using this code:

#!/usr/bin/env python3

from pathlib import Path
import sys
import cv2
import depthai as dai
import numpy as np
from time import monotonic

# Get argument first
parentDir = Path(__file__).parent
nnPath = str((parentDir / Path('../models/mobilenet-ssd_openvino_2021.4_8shave.blob')).resolve().absolute())
videoPath = str((parentDir / Path('../models/construction_vest.mp4')).resolve().absolute())
if len(sys.argv) > 2:
    nnPath = sys.argv[1]
    videoPath = sys.argv[2]

if not Path(nnPath).exists() or not Path(videoPath).exists():
    import sys
    raise FileNotFoundError(f'Required file/s not found, please run "{sys.executable} install_requirements.py"')

# MobilenetSSD label texts
labelMap = ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
            "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

xinFrame = pipeline.create(dai.node.XLinkIn)
nnOut = pipeline.create(dai.node.XLinkOut)

xinFrame.setStreamName("inFrame")
nnOut.setStreamName("nn")

# Properties
nn.setConfidenceThreshold(0.5)
nn.setBlobPath(nnPath)
nn.setNumInferenceThreads(2)
nn.input.setBlocking(False)

# Linking
xinFrame.out.link(nn.input)
nn.out.link(nnOut.input)

# Connect to device and start pipeline
with dai.Device(pipeline) as device:

    # Input queue will be used to send video frames to the device.
    qIn = device.getInputQueue(name="inFrame")
    # Output queue will be used to get nn data from the video frames.
    qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False)

    frame = None
    detections = []

    # nn data, being the bounding box locations, are in <0..1> range - they need to be normalized with frame width/height
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

    def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:
        return cv2.resize(arr, shape).transpose(2, 0, 1).flatten()

    def displayFrame(name, frame):
        for detection in detections:
            bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
            cv2.putText(frame, labelMap[detection.label], (bbox[0] + 10, bbox[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
            cv2.putText(frame, f"{int(detection.confidence * 100)}%", (bbox[0] + 10, bbox[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
            cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 0, 0), 2)
        # Show the frame
        cv2.imshow(name, frame)

    cap = cv2.VideoCapture(videoPath)
    while cap.isOpened():
        read_correctly, frame = cap.read()
        if not read_correctly:
            break

        img = dai.ImgFrame()
        img.setData(to_planar(frame, (300, 300)))
        img.setTimestamp(monotonic())
        img.setWidth(300)
        img.setHeight(300)
        qIn.send(img)

        inDet = qDet.tryGet()

        if inDet is not None:
            detections = inDet.detections

        if frame is not None:
            displayFrame("rgb", frame)

        if cv2.waitKey(1) == ord('q'):
            break

so I thought that these lines would rescale the frame correctly:

        img = dai.ImgFrame()
        img.setData(to_planar(frame, (300, 300)))
        img.setTimestamp(monotonic())
        img.setWidth(300)
        img.setHeight(300)
        qIn.send(img)

Should I rescale the video before using this script?

pierreia

I've added some lines to check the frame that was going into the NN, and to ensure the image is resized:

        img = dai.ImgFrame()
        resized_planar = to_planar(frame, (300, 300))
        img.setType(dai.RawImgFrame.Type.BGR888p)
        img.setSize(300, 300)
        img.setTimestamp(monotonic())
        img.setFrame(resized_planar)
        cv2.imshow("Img Frame Output", img.getCvFrame())
        qIn.send(img)

It seems that the frame is properly sized and the model should be able to detect cars on the image. However, I still got no detections:

Any idea on what could be the problem?

erik

Hi pierreia
Interesting. Could you provide a full MRE please?
Thanks, Erik

pierreia

Hi erik , here is my MRE.

https://filetransfer.io/data-package/oCBNoI32#link

I've opened an Issue on Github: https://github.com/luxonis/depthai/issues/787

Thank you,

Pierre

erik

Hi pierreia ,
The code looks good to me (normalization of bounding boxes), I think the problem is with the model and/or deocding (so anchors/masks). I would also check for incorrect color order and/or incorrect channel layout (both mentioned here).
Thanks, Erik

pierreia

erik

We tried with the latest MobileNet "from_zoo", and a YOLO model, we got the same results. So I'm not sure it comes from the model itself.
Regarding the channel layout, since we put the frame though an OpenCv manipulation before passing it to the model, the frame should be in BGR channel layout, which is what we set in the ImgFrame .

Update: it seems to be working with Yolo tiny, despite it is very slow

erik

Hi pierreia , could you share the code with the mobilened from blobconverter?
THnaks, Erik

pierreia

Hi erik , you can download the code and the mobilenet here: https://filetransfer.io/data-package/oCBNoI32#link

erik

Hi pierreia .
The only problem in the code above is that it's not synced. You could either sync frames+detections with host-side syncing, or just use passthrough frame like I did below:

#!/usr/bin/env python3

from pathlib import Path
import sys
import cv2
import depthai as dai
import numpy as np
from time import monotonic

import blobconverter

# Get argument first
nnPath = 'mobilenet-ssd_openvino_2021.4_8shave.blob'
videoPath = 'traffic_5mn.mp4'
if len(sys.argv) > 2:
    nnPath = sys.argv[1]
    videoPath = sys.argv[2]

if not Path(nnPath).exists() or not Path(videoPath).exists():
    import sys
    raise FileNotFoundError(f'Required file/s not found, please run "{sys.executable} install_requirements.py"')

# MobilenetSSD label texts
labelMap = ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
            "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

xinFrame = pipeline.create(dai.node.XLinkIn)
xinFrame.setStreamName("inFrame")
xinFrame.out.link(nn.input)
# Properties
nn.setConfidenceThreshold(0.5)
nn.setBlobPath(nnPath)
nn.setNumInferenceThreads(2)
nn.input.setBlocking(True)

# Linking
nnOut = pipeline.create(dai.node.XLinkOut)
nnOut.setStreamName("nn")
nn.out.link(nnOut.input)

nnPass = pipeline.create(dai.node.XLinkOut)
nnPass.setStreamName("pass")
nn.passthrough.link(nnPass.input)


# Connect to device and start pipeline
with dai.Device(pipeline) as device:

    # Input queue will be used to send video frames to the device.
    qIn = device.getInputQueue(name="inFrame")
    # Output queue will be used to get nn data from the video frames.
    qDet = device.getOutputQueue(name="nn", maxSize=6, blocking=True)
    qPass = device.getOutputQueue("pass")

    frame = None
    detections = []

    # nn data, being the bounding box locations, are in <0..1> range - they need to be normalized with frame width/height
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

    def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:
        return cv2.resize(arr, shape).transpose(2, 0, 1).flatten()

    def displayFrame(name, frame):
        for detection in detections:
            bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
            cv2.putText(frame, labelMap[detection.label], (bbox[0] + 10, bbox[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
            cv2.putText(frame, f"{int(detection.confidence * 100)}%", (bbox[0] + 10, bbox[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
            cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 0, 0), 2)
        # Show the frame
        cv2.imshow(name, frame)

    cap = cv2.VideoCapture(videoPath)
    while cap.isOpened():
        read_correctly, frame = cap.read()
        if not read_correctly:
            break

        img = dai.ImgFrame()
        resized = to_planar(frame, (300, 300))
        img.setTimestamp(monotonic())
        img.setType(dai.RawImgFrame.Type.BGR888p)
        img.setSize(300, 300)
        img.setData(resized)
        qIn.send(img)

        inDet = qDet.tryGet()

        if inDet is not None:
            detections = inDet.detections
            frame = qPass.get().getCvFrame()
            displayFrame("passthrough", frame)

        if cv2.waitKey(1) == ord('q'):
            break

pierreia

erik Thank you very much erik , now it works!!

krishnashravan

erik
i'm trying live inferencing and video inferencing on the yolov8 nano
pipeline

krishnashravan

def init_pipeline():

pipeline = depthai.Pipeline()

cam_rgb = pipeline.createColorCamera()

detection_nn = pipeline.createYoloDetectionNetwork()

cam_rgb.setResolution(

    depthai.ColorCameraProperties.SensorResolution.THE_4_K)

cam_rgb.setPreviewSize(640, 640)

cam_rgb.setInterleaved(True)

xout_rgb = pipeline.createXLinkOut()

xout_rgb.setStreamName("rgb")

cam_rgb.preview.link(xout_rgb.input)

cam_rgb.setPreviewKeepAspectRatio(False)

manip1 = pipeline.createImageManip()

manip1.setMaxOutputFrameSize(1244160)

manip1.initialConfig.setResize(sizeX, sizeY)

cam_rgb.preview.link(manip1.inputImage)

manip1.initialConfig.setFrameType(depthai.ImgFrame.Type.BGR888p)

manip1.inputImage.setBlocking(True)

if args.videoPath is not None:

    xinFrame = pipeline.create(depthai.node.XLinkIn)

    xinFrame.setStreamName("inFrame")

    xinFrame.out.link(manip1.inputImage)

    xinFrame.setMaxDataSize(1920\*1080\*3)

    nnPass = pipeline.create(depthai.node.XLinkOut)

    nnPass.setStreamName("pass")

    detection_nn.passthrough.link(xout_rgb.input)

else:

    xinFrame = None

# Extract the values from the JSON

num_classes = config['nn_config']['NN_specific_metadata']['classes']

coordinates = config['nn_config']['NN_specific_metadata']['coordinates']

anchors = config['nn_config']['NN_specific_metadata']['anchors']

anchor_masks = config['nn_config']['NN_specific_metadata']['anchor_masks']

iou_threshold = config['nn_config']['NN_specific_metadata']['iou_threshold']

# Set the values

detection_nn.setNumClasses(num_classes)

detection_nn.setCoordinateSize(coordinates)

detection_nn.setAnchors(anchors)

detection_nn.setAnchorMasks(anchor_masks)

detection_nn.setIouThreshold(iou_threshold)

detection_nn.setConfidenceThreshold(0.5)

# detection_nn.setNumInferenceThreads(2)

detection_nn.input.setBlocking(True)





# Blob is the Neural Network file, compiled for MyriadX. It contains both the definition and weights of the model

# We're using a blobconverter tool to retreive the MobileNetSSD blob automatically from OpenVINO Model Zoo

# detection_nn.setBlobPath(blobconverter.from_zoo(name='mobilenet-ssd', shaves=6))

# Next, we filter out the detections that are below a confidence threshold. Confidence can be anywhere between <0..1>

# Next, we link the camera 'preview' output to the neural network detection input, so that it can produce detections

manip1.out.link(detection_nn.input)

if customModel is True:

    nnPath = str(

        (parentDir / Path('../../data/' + model)).resolve().absolute())

    # print(nnPath)

    detection_nn.setBlobPath(nnPath)

    print("Custom Model" + nnPath + "Size: " +

          str(sizeX) + "x" + str(sizeY))

else:

    detection_nn.setBlobPath(blobconverter.from_zoo(

        name='person-detection-0106', shaves=6))

    print("Model from OpenVINO Zoo" + "Size: " +

          str(sizeX) + "x" + str(sizeY))

xout_nn = pipeline.createXLinkOut()

xout_nn.setStreamName("nn")

detection_nn.out.link(xout_nn.input)

return pipeline

def detect_and_count():

global outputFrame, lock, zones_current_count, listeners, loop

pipeline = init_pipeline()

inputFrameShape = (sizeX, sizeY)

with depthai.Device(pipeline) as device:

    q_rgb = device.getOutputQueue("rgb")

    q_nn = device.getOutputQueue("nn")

    qPass = device.getOutputQueue("pass")

    # q_manip = device.getInputQueue("")

    baseTs = time.monotonic()

    simulatedFps = 30

    frame = None

    detections = []

    timestamp = datetime.utcnow()

    zone_data = []

    def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:

        return cv2.resize(arr, shape).transpose(2, 0, 1).flatten()

    if args.videoPath is not None:

        videoPath = str(

            (parentDir / Path('../../data/' + video_source)).resolve().absolute())

        cap = cv2.VideoCapture(videoPath, cv2.CAP_FFMPEG)

    # loop over frames from the video stream

    while True:

        if args.videoPath is not None:

            read_correctly, frame = cap.read()

            

            if not read_correctly:

                break

            if args.videoPath is not None:

                q_vid = device.getInputQueue(name="inFrame")

                img = depthai.ImgFrame()

                img.setType(depthai.RawImgFrame.Type.BGR888p)

                img.setData(to_planar(frame, inputFrameShape))

                img.setTimestamp(baseTs)

                baseTs += 1/simulatedFps

                img.setWidth(inputFrameShape[0])

                img.setHeight(inputFrameShape[1])

                q_vid.send(img)

                # in_vid = q_vid.tryGet()

                print("hello", timestamp)

                if args.videoPath is not None:

                    print("video")

                    frame = qPass.get().getCvFrame()

                

        in_rgb = q_rgb.tryGet()

        in_nn = q_nn.tryGet()

        

        if in_rgb is not None and args.videoPath is None:

            print("live")

            frame = in_rgb.getCvFrame()

        

        if in_nn is not None:

            print("detect")

            detections = in_nn.detections

            zone_data += check_overlap(frame, detections)

            print("done",timestamp)

            now = datetime.utcnow()

            if now.second != timestamp.second:

                t = threading.Thread(

                    target=insert_data, args=(zone_data, ))

                t.daemon = True

                t.start()

                zone_data = []

            timestamp = now

            with lock:

                outputFrame = frame.copy()

                print("finish")

        if args.videoPath is not None:

            ret, frame = cap.read()

            if not ret:

                print("video over", timestamp)

                cap.release()

                break

                # at any time, you can press "q" and exit the main loop, therefore exiting the program itself

        if cv2.waitKey(1) == ord('q'):

            break

parser = argparse.ArgumentParser()

parser.add_argument('-v', '--videoPath',

                    help="Path to video frame", default=None)

args = parser.parse_args()

video_source = args.videoPath

what's happening is both the live and video inferening is happening at a time and it stops after 30 seconds
any take on what i'm doing wrong?

erik

krishnashravan Please provide minimal repro example.

krishnashravan

erik
Hi erik i have attached the files here
the test 4 file gives a error where the video keeps on changing the size of the preview

the test 5 file gives a error where the video shifts to live inferencing in between the frames
the common error that i found was that both these stop working after 30 seconds
only live inferencing works fine but the video inferencing stops after 30 seconds
let me know if u need anything else

erik

This isn't reproducible.

 python .\test4.py
Traceback (most recent call last):
  File "D:\Downloads\yolov8-testing-pt-files-New%20folder\test4.py", line 175, in <module>
    parser = argparse.ArgumentParser()
NameError: name 'argparse' is not defined

krishnashravan

erik
i have made the changes and use the yolov8n.json file for the metadata
here

erik

Again, this is not reproducible. Did you even try running the test4.py/test5.py?

krishnashravan

erik

sorry erik
but i fixed it