Oak-1-Lite, 4K RGB demo, bounding boxes aren't scaled correctly.

theDesertMoon · May 13, 2022

Hi,
I was trying to scale the 300x300 preview to 640x360 video for additional image processing but can't seem to get the bounding boxes to scale appropriately.

I tried the 4k_RGB demo from here just with my custom model https://docs.luxonis.com/projects/api/en/v2.3.0.0/samples/15_rgb_mobilenet_4k/#k-rgb-mobilenetssd and I realized that I'm having the same issues. On the 300x300 preview the bottles are marked correctly but whether at THE_4_K or THE_1080_P my boxes drift from their respective objects. I also tried it with the provided mobilenet-ssd_openvino_2021.2_5shave.blob and the same issue occurs.

Do you have any pointers on how to correct this? Keeping the location accurate is of some importance.

This is what my code looks like.
` # Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
## Color camera input
self.oaklogging.info(self.camlogident+"Configuring color camera")
camRgb = pipeline.create(dai.node.ColorCamera)
camRgb.setPreviewSize(300,300) # Neural net image resolution.
camRgb.setInterleaved(False)

camRgb.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
camRgb.setIspScale(1,3) # You don't need to downscale (4k -> 720P) video frames
camRgb.setVideoSize(640,360)

## Color camera output
xoutPreview = pipeline.create(dai.node.XLinkOut)
xoutPreview.setStreamName("preview")
#camRgb.preview.link(xoutPreview.input)
camRgb.video.link(xoutPreview.input)

if self.icobjects:
  ## Neural network input
  self.oaklogging.info(self.camlogident+"Configuring neural network")
  nn = pipeline.create(dai.node.MobileNetDetectionNetwork)
  nn.setConfidenceThreshold(self.minconfi)
  nn.setBlobPath(nnPath)
  nn.setNumInferenceThreads(2)
  nn.input.setBlocking(False)
  # Linking camera input to nn
  camRgb.preview.link(nn.input)
  ## Neural network output
  nnOut = pipeline.create(dai.node.XLinkOut)
  nnOut.setStreamName("nn")
  nn.out.link(nnOut.input)`

GGergelySzabolcs · May 13, 2022

theDesertMoon

You need to set:

stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())

Like here:
https://github.com/luxonis/depthai-python/blob/main/examples/SpatialDetection/spatial_mobilenet.py#L64

Make sure to update to the latest library, 2.15.4.0, we recently fixed some BUGs related to stability when RGB alignment was enabled.

theDesertMoon · May 13, 2022

GergelySzabolcs

Would this be valid for the Oak-1-Lite? I don't have the two mono cameras for disparity or the stereo pipeline created. I did update depthai using the force-reinstall option but it didn't help in this specific case.

GGergelySzabolcs · May 13, 2022

theDesertMoon

You can check out this example:
https://github.com/luxonis/depthai-python/blob/main/examples/MobileNet/rgb_mobilenet_4k.py

Maps bounding box from preview to video.

theDesertMoon · May 13, 2022

GergelySzabolcs
Thank you for you reply.

While I see several of the differences between this script and your '15_rgb_mobilenet_4k' example I'm still seeing the same results.

Is there some inherent limitation to the Oak-1-Lite?

The only change I made to the script you just pointed me to was to change the location of the blob file.

I have an Oak-D-Lite connected to the same system (Raspberry Pi 4B, if that makes a difference) and it exhibits the same behavior.

theDesertMoon · May 13, 2022

GergelySzabolcs
I just realize that I lied. I did make one other change.

With 'camRgb.setPreviewKeepAspectRatio(False)' set this way the model seems a lot more inaccurate so I've habitually set it to 'True'. When I changed it back to 'False' the model seems to lose some accuracy but when it does recognize an item it does map the bounding box correctly to the larger image.

Is there a way to hit a sweet point to maximize the model viability while allowing for a correct mapping to a larger image?

Thank you!

GGergelySzabolcs · May 13, 2022

@theDesertMoon
Check this one out. Without camRgb.setPreviewKeepAspectRatio(False).
1920x1080 is cropped to 1080x1080, which is scaled to 300x300 to preserve aspect ratio. So need to add (1920-1080)/2 when displaying boundig boxes on full resolution.

#!/usr/bin/env python3

from pathlib import Path
import sys
import cv2
import depthai as dai
import numpy as np

# Get argument first
nnPath = str((Path(__file__).parent / Path('../models/mobilenet-ssd_openvino_2021.4_5shave.blob')).resolve().absolute())
if len(sys.argv) > 1:
    nnPath = sys.argv[1]

if not Path(nnPath).exists():
    import sys
    raise FileNotFoundError(f'Required file/s not found, please run "{sys.executable} install_requirements.py"')

# MobilenetSSD label texts
labelMap = ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
            "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
camRgb = pipeline.create(dai.node.ColorCamera)
nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

xoutVideo = pipeline.create(dai.node.XLinkOut)
xoutPreview = pipeline.create(dai.node.XLinkOut)
nnOut = pipeline.create(dai.node.XLinkOut)

xoutVideo.setStreamName("video")
xoutPreview.setStreamName("preview")
nnOut.setStreamName("nn")

# Properties
camRgb.setPreviewSize(300, 300)    # NN input
camRgb.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
camRgb.setInterleaved(False)
# camRgb.setPreviewKeepAspectRatio(False)
# Define a neural network that will make predictions based on the source frames
nn.setConfidenceThreshold(0.5)
nn.setBlobPath(nnPath)
nn.setNumInferenceThreads(2)
nn.input.setBlocking(False)

# Linking
camRgb.video.link(xoutVideo.input)
camRgb.preview.link(xoutPreview.input)
camRgb.preview.link(nn.input)
nn.out.link(nnOut.input)

videoSize = camRgb.getVideoSize()

videoOffset = (videoSize[0] - videoSize[1]) / 2
print(videoOffset)
# Connect to device and start pipeline
with dai.Device(pipeline) as device:

    # Output queues will be used to get the frames and nn data from the outputs defined above
    qVideo = device.getOutputQueue(name="video", maxSize=4, blocking=False)
    qPreview = device.getOutputQueue(name="preview", maxSize=4, blocking=False)
    qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False)

    previewFrame = None
    videoFrame = None
    detections = []

    # nn data, being the bounding box locations, are in <0..1> range - they need to be normalized with frame width/height
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

    def displayFrame(name, frame, offsetX = 0):
        color = (255, 0, 0)
        for detection in detections:
            bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
            bbox[0] += offsetX
            bbox[2] -= offsetX
            cv2.putText(frame, labelMap[detection.label], (bbox[0] + 10, bbox[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, color)
            cv2.putText(frame, f"{int(detection.confidence * 100)}%", (bbox[0] + 10, bbox[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, color)
            cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)
        # Show the frame
        cv2.imshow(name, frame)

    cv2.namedWindow("video", cv2.WINDOW_NORMAL)
    cv2.resizeWindow("video", 1280, 720)
    print("Resize video window with mouse drag!")

    while True:
        # Instead of get (blocking), we use tryGet (non-blocking) which will return the available data or None otherwise
        inVideo = qVideo.tryGet()
        inPreview = qPreview.tryGet()
        inDet = qDet.tryGet()

        if inVideo is not None:
            videoFrame = inVideo.getCvFrame()

        if inPreview is not None:
            previewFrame = inPreview.getCvFrame()

        if inDet is not None:
            detections = inDet.detections

        if videoFrame is not None:
            displayFrame("video", videoFrame, videoOffset)

        if previewFrame is not None:
            displayFrame("preview", previewFrame)

        if cv2.waitKey(1) == ord('q'):
            break

theDesertMoon · May 13, 2022

GergelySzabolcs
Thank you for your efforts.

Its still the same, just different.
The 'train' seems to transition mostly correctly. But it is centered also. I'm playing with the numbers on my side some too, however.

theDesertMoon · May 13, 2022

GergelySzabolcs
FWIW, I played with the numbers a bit more!

I realized that 'train' tended to keep itself correct and the further from center the more out of proportion the dimension was so I added the below code to modify it as such. I used your 'offset' value to determine if I needed to alter the dimension.

I need to redo the math however but its almost on the right path for anyone with a similar issue.

`

def frameNorm(frame, bbox):
    normVals = np.full(len(bbox), frame.shape[0])
    normVals[::2] = frame.shape[1]
    return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

def displayFrame(name, frame, offsetX = 0):
    color = (255, 0, 0)
    #print(frame.shape)
    for detection in detections:
        bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
        if(offsetX == 1):
          #print("*** *** ***")
          #print(bbox)
          bbox[0] = bbox[0]+(((frame.shape[1]/2)-bbox[0])*(300/frame.shape[0]))
          bbox[2] = bbox[2]+(((frame.shape[1]/2)-bbox[2])*(300/frame.shape[0]))
          bbox[1] = bbox[1]+(((frame.shape[0]/2)-bbox[1])*(300/frame.shape[1]))
          bbox[3] = bbox[3]+(((frame.shape[0]/2)-bbox[3])*(300/frame.shape[1]))
          #print(bbox)
          #print("*** *** ***")
        cv2.putText(frame, labelMap[detection.label], (bbox[0] + 10, bbox[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, color)
        cv2.putText(frame, f"{int(detection.confidence * 100)}%", (bbox[0] + 10, bbox[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, color)
        cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)
    # Show the frame
    cv2.imshow(name, frame)

`

erik · May 13, 2022

Hello theDesertMoon , I would also suggest looking at this demo:
https://docs.luxonis.com/projects/api/en/latest/tutorials/dispaying_detections/#edit-bounding-boxes
Thanks, Erik

theDesertMoon · May 16, 2022

erik
Wow, Erik! My face is red. Definitely a case of RTFM on my part. Thank you as your maths are far superior to my own and the results, as you expected, are spot on.

@GergelySzabolcs, my apologies also. I will slap my hand and dig a little further in the future.

erik · May 17, 2022

I'm glad you got it working theDesertMoon !