Maintaining Z-Coordinate Reliability in Cropped Images

Henry · Mar 30, 2023

Hello DepthAI Community,

I have been working on a project that involves the use of the DepthAI framework, and we have recently encountered an issue that we hope you can help us resolve.

In our main.py file, our objective is to perform object detection on a cropped image acquired from the Image Signal Processor (ISP). To give you a better idea, here's a snippet of our code where the image is cropped
`
def create_pipeline(stereo):
pipeline = dai.Pipeline()

print("Creating Color Camera...")
cam = pipeline.create(dai.node.ColorCamera)
cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP)
cam.setIspScale(1,2)
cam.setPreviewSize(2028, 1520)
cam.setInterleaved(False)
cam.setBoardSocket(dai.CameraBoardSocket.RGB)

# Workaround: remove in 2.18, use `cam.setPreviewNumFramesPool(10)`
# This manip uses 15*3.5 MB => 52 MB of RAM.
copy_manip = pipeline.create(dai.node.ImageManip)
copy_manip.setNumFramesPool(15)
# copy_manip.setMaxOutputFrameSize(3499200)
copy_manip.setMaxOutputFrameSize(2028*1520*3)

# Crop range
topLeft = dai.Point2f(0.3, 0.3)
bottomRight = dai.Point2f(0.7, 0.7)
copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)

cam.preview.link(copy_manip.inputImage)

cam_xout = pipeline.create(dai.node.XLinkOut)
cam_xout.setStreamName("color")
copy_manip.out.link(cam_xout.input)

# ImageManip will resize the frame before sending it to the Face detection NN node
face_det_manip = pipeline.create(dai.node.ImageManip)
face_det_manip.initialConfig.setResize(300, 300)
face_det_manip.initialConfig.setKeepAspectRatio(False)
face_det_manip.initialConfig.setFrameType(dai.RawImgFrame.Type.RGB888p)
copy_manip.out.link(face_det_manip.inputImage)

if stereo:
    monoLeft = pipeline.create(dai.node.MonoCamera)
    monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)

    monoRight = pipeline.create(dai.node.MonoCamera)
    monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)

    stereo = pipeline.create(dai.node.StereoDepth)
    stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
    stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
    
    stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())
    monoLeft.out.link(stereo.left)
    monoRight.out.link(stereo.right)

    # Spatial Detection network if OAK-D
    print("OAK-D detected, app will display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetSpatialDetectionNetwork)
    face_det_nn.setBoundingBoxScaleFactor(0.8)
    face_det_nn.setDepthLowerThreshold(100)
    face_det_nn.setDepthUpperThreshold(5000)
    stereo.depth.link(face_det_nn.inputDepth)
else: # Detection network if OAK-1
    print("OAK-1 detected, app won't display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

face_det_nn.setConfidenceThreshold(0.5)
face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-retail-0004", shaves=6))
face_det_nn.input.setQueueSize(1)
face_det_manip.out.link(face_det_nn.input)

# Send face detections to the host (for bounding boxes)
face_det_xout = pipeline.create(dai.node.XLinkOut)
face_det_xout.setStreamName("detection")
face_det_nn.out.link(face_det_xout.input)

`
After executing the cropping process, we have noticed that the reliability of the z-coordinate's output becomes compromised:

coords = "Z: {:.2f} m".format(detection.spatialCoordinates.z/1000)

We would greatly appreciate any guidance on how to properly crop the image while maintaining the accuracy and reliability of the z-coordinate. Any insights, suggestions, or examples you can share would be incredibly helpful.

Thank you in advance for your assistance, and we look forward to hearing from you soon.

Best regards,

Henry

jakaskerl · Mar 31, 2023

Hi Henry
Would you mind sending the Minimal Reproducible Example (MRE) code so we can run and debug the code?

Thanks,
Jaka

Henry · Mar 31, 2023

The code contains two files. Do you want me to copy & paste the codes here?

Best

Henry

Henry · Mar 31, 2023

main.py

`from MultiMsgSync import TwoStageHostSeqSync
import blobconverter
import cv2
import depthai as dai
import numpy as np
import datetime
import time

print("DepthAI version", dai.version)
def frame_norm(frame, bbox):
normVals = np.full(len(bbox), frame.shape[0])
normVals[::2] = frame.shape[1]
return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

def create_pipeline(stereo):
pipeline = dai.Pipeline()

print("Creating Color Camera...")
cam = pipeline.create(dai.node.ColorCamera)
cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP)
cam.setIspScale(1,5)
cam.setPreviewSize(676, 506)
cam.setInterleaved(False)
cam.setBoardSocket(dai.CameraBoardSocket.RGB)
cam.setFps(10)

# Workaround: remove in 2.18, use `cam.setPreviewNumFramesPool(10)`
# This manip uses 15*3.5 MB => 52 MB of RAM.
copy_manip = pipeline.create(dai.node.ImageManip)
copy_manip.setNumFramesPool(15)
# copy_manip.setMaxOutputFrameSize(3499200)
copy_manip.setMaxOutputFrameSize(676*506*3)

# Crop range
topLeft = dai.Point2f(0.3, 0.3)
bottomRight = dai.Point2f(0.7, 0.7)
# copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)

cam.preview.link(copy_manip.inputImage)

cam_xout = pipeline.create(dai.node.XLinkOut)
cam_xout.setStreamName("color")
copy_manip.out.link(cam_xout.input)

# ImageManip will resize the frame before sending it to the Face detection NN node
face_det_manip = pipeline.create(dai.node.ImageManip)
# face_det_manip.initialConfig.setResize(300, 300)
face_det_manip.initialConfig.setResize(672, 384)
face_det_manip.initialConfig.setKeepAspectRatio(False)
face_det_manip.initialConfig.setFrameType(dai.RawImgFrame.Type.RGB888p)
copy_manip.out.link(face_det_manip.inputImage)

if stereo:
    monoLeft = pipeline.create(dai.node.MonoCamera)
    monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)

    monoRight = pipeline.create(dai.node.MonoCamera)
    monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)

    stereo = pipeline.create(dai.node.StereoDepth)
    stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
    stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
    
    stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())
    monoLeft.out.link(stereo.left)
    monoRight.out.link(stereo.right)

    # Spatial Detection network if OAK-D
    print("OAK-D detected, app will display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetSpatialDetectionNetwork)
    face_det_nn.setBoundingBoxScaleFactor(0.8)
    face_det_nn.setDepthLowerThreshold(100)
    face_det_nn.setDepthUpperThreshold(5000)
    stereo.depth.link(face_det_nn.inputDepth)
else: # Detection network if OAK-1
    print("OAK-1 detected, app won't display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

face_det_nn.setConfidenceThreshold(0.5)
# face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-retail-0004", shaves=6))
face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-adas-0001", shaves=6))
face_det_nn.input.setQueueSize(1)
face_det_manip.out.link(face_det_nn.input)

# Send face detections to the host (for bounding boxes)
face_det_xout = pipeline.create(dai.node.XLinkOut)
face_det_xout.setStreamName("detection")
face_det_nn.out.link(face_det_xout.input)

# Script node will take the output from the face detection NN as an input and set ImageManipConfig
# to the 'recognition_manip' to crop the initial frame
image_manip_script = pipeline.create(dai.node.Script)
face_det_nn.out.link(image_manip_script.inputs['face_det_in'])

# Remove in 2.18 and use `imgFrame.getSequenceNum()` in Script node
face_det_nn.passthrough.link(image_manip_script.inputs['passthrough'])

copy_manip.out.link(image_manip_script.inputs['preview'])

image_manip_script.setScript("""
import time
msgs = dict()

def add_msg(msg, name, seq = None):
    global msgs
    if seq is None:
        seq = msg.getSequenceNum()
    seq = str(seq)
    # node.warn(f"New msg {name}, seq {seq}")

    # Each seq number has it's own dict of msgs
    if seq not in msgs:
        msgs[seq] = dict()
    msgs[seq][name] = msg

    # To avoid freezing (not necessary for this ObjDet model)
    if 15 < len(msgs):
        node.warn(f"Removing first element! len {len(msgs)}")
        msgs.popitem() # Remove first element

def get_msgs():
    global msgs
    seq_remove = [] # Arr of sequence numbers to get deleted
    for seq, syncMsgs in msgs.items():
        seq_remove.append(seq) # Will get removed from dict if we find synced msgs pair
        # node.warn(f"Checking sync {seq}")

        # Check if we have both detections and color frame with this sequence number
        if len(syncMsgs) == 2: # 1 frame, 1 detection
            for rm in seq_remove:
                del msgs[rm]
            # node.warn(f"synced {seq}. Removed older sync values. len {len(msgs)}")
            return syncMsgs # Returned synced msgs
    return None

def correct_bb(bb):
    if bb.xmin < 0: bb.xmin = 0.001
    if bb.ymin < 0: bb.ymin = 0.001
    if bb.xmax > 1: bb.xmax = 0.999
    if bb.ymax > 1: bb.ymax = 0.999
    return bb

while True:
    time.sleep(0.001) # Avoid lazy looping

    preview = node.io['preview'].tryGet()
    if preview is not None:
        add_msg(preview, 'preview')

    face_dets = node.io['face_det_in'].tryGet()
    if face_dets is not None:
        # TODO: in 2.18.0.0 use face_dets.getSequenceNum()
        passthrough = node.io['passthrough'].get()
        seq = passthrough.getSequenceNum()
        add_msg(face_dets, 'dets', seq)

    sync_msgs = get_msgs()
    if sync_msgs is not None:
        img = sync_msgs['preview']
        dets = sync_msgs['dets']
        for i, det in enumerate(dets.detections):
            cfg = ImageManipConfig()
            correct_bb(det)
            cfg.setCropRect(det.xmin, det.ymin, det.xmax, det.ymax)
            # node.warn(f"Sending {i + 1}. det. Seq {seq}. Det {det.xmin}, {det.ymin}, {det.xmax}, {det.ymax}")
            cfg.setResize(62, 62)
            cfg.setKeepAspectRatio(False)
            node.io['manip_cfg'].send(cfg)
            node.io['manip_img'].send(img)
""")

recognition_manip = pipeline.create(dai.node.ImageManip)
recognition_manip.initialConfig.setResize(62, 62)
recognition_manip.setWaitForConfigInput(True)
image_manip_script.outputs['manip_cfg'].link(recognition_manip.inputConfig)
image_manip_script.outputs['manip_img'].link(recognition_manip.inputImage)

# Second stange recognition NN
print("Creating recognition Neural Network...")
recognition_nn = pipeline.create(dai.node.NeuralNetwork)
recognition_nn.setBlobPath(blobconverter.from_zoo(name="age-gender-recognition-retail-0013", shaves=6))
recognition_manip.out.link(recognition_nn.input)

recognition_xout = pipeline.create(dai.node.XLinkOut)
recognition_xout.setStreamName("recognition")
recognition_nn.out.link(recognition_xout.input)

return pipeline

prefix = './images/' + datetime.datetime.now().strftime("%Y-%m-%d %H%M%S_")
startTime = time.time()
count = 0
wait_time = 200
max_images = 10
with dai.Device() as device:
stereo = 1 < len(device.getConnectedCameras())
device.startPipeline(create_pipeline(stereo))

sync = TwoStageHostSeqSync()
queues = {}
# Create output queues
for name in ["color", "detection", "recognition"]:
    queues[name] = device.getOutputQueue(name)

while True:
    for name, q in queues.items():
        # Add all msgs (color frames, object detections and recognitions) to the Sync class.
        if q.has():
            sync.add_msg(q.get(), name)

    msgs = sync.get_msgs()

    if msgs is not None:
        frame = msgs["color"].getCvFrame()

        if time.time() - startTime > wait_time:
            count += 1
            if count < max_images:
                cv2.imwrite(prefix + str(count) + '.jpg', frame)
            else:
                break

        detections = msgs["detection"].detections
        recognitions = msgs["recognition"]

        for i, detection in enumerate(detections):
            bbox = frame_norm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))

            # Decoding of recognition results
            rec = recognitions[i]
            age = int(float(np.squeeze(np.array(rec.getLayerFp16('age_conv3')))) * 100)
            gender = np.squeeze(np.array(rec.getLayerFp16('prob')))
            gender_str = "female" if gender[0] > gender[1] else "male"

            cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (10, 245, 10), 2)
            y = (bbox[1] + bbox[3]) // 2
            cv2.putText(frame, str(age), (bbox[0], y), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (0, 0, 0), 8)
            cv2.putText(frame, str(age), (bbox[0], y), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (255, 255, 255), 2)
            cv2.putText(frame, gender_str, (bbox[0], y + 30), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (0, 0, 0), 8)
            cv2.putText(frame, gender_str, (bbox[0], y + 30), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (255, 255, 255), 2)
            if stereo:
                # You could also get detection.spatialCoordinates.x and detection.spatialCoordinates.y coordinates
                coords = "Z: {:.2f} m".format(detection.spatialCoordinates.z/1000)
                cv2.putText(frame, coords, (bbox[0], y + 60), cv2.FONT_HERSHEY_TRIPLEX, 1, (0, 0, 0), 8)
                cv2.putText(frame, coords, (bbox[0], y + 60), cv2.FONT_HERSHEY_TRIPLEX, 1, (255, 255, 255), 2)

        if time.time() - startTime < wait_time:
            cv2.putText(frame, "Wait for " + str(wait_time - int(time.time() - startTime)) + " seconds.", (frame.shape[1]-400, 50), cv2.FONT_HERSHEY_TRIPLEX, 1, (255, 255, 255), 2)

        cv2.imshow("Camera", frame)
    if cv2.waitKey(1) == ord('q'):
        break

`

jakaskerl · Mar 31, 2023

Hi Henry
Yes, but only paste the minimal code you need for the example to work.
Thanks,
Jaka

Henry · Mar 31, 2023

MultiMsgSync.py

`# Color frames (ImgFrame), object detection (ImgDetections) and recognition (NNData)

messages arrive to the host all with some additional delay.

For each ImgFrame there's one ImgDetections msg, which has multiple detections, and for each

detection there's a NNData msg which contains recognition results.

How it works:

Every ImgFrame, ImgDetections and NNData message has it's own sequence number, by which we can sync messages.

class TwoStageHostSeqSync:
def init(self):
self.msgs = {}

name: color, detection, or recognition

def add_msg(self, msg, name):
seq = str(msg.getSequenceNum())
if seq not in self.msgs:
self.msgs[seq] = {} # Create directory for msgs
if "recognition" not in self.msgs[seq]:
self.msgs[seq]["recognition"] = [] # Create recognition array

    if name == "recognition":
        # Append recognition msgs to an array
        self.msgs[seq]["recognition"].append(msg)
        # print(f'Added recognition seq {seq}, total len {len(self.msgs[seq]["recognition"])}')

    elif name == "detection":
        # Save detection msg in the directory
        self.msgs[seq][name] = msg
        self.msgs[seq]["len"] = len(msg.detections)
        # print(f'Added detection seq {seq}')

    elif name == "color": # color
        # Save color frame in the directory
        self.msgs[seq][name] = msg
        # print(f'Added frame seq {seq}')


def get_msgs(self):
    seq_remove = [] # Arr of sequence numbers to get deleted

    for seq, msgs in self.msgs.items():
        seq_remove.append(seq) # Will get removed from dict if we find synced msgs pair

        # Check if we have both detections and color frame with this sequence number
        if "color" in msgs and "len" in msgs:

            # Check if all detected objects (faces) have finished recognition inference
            if msgs["len"] == len(msgs["recognition"]):
                # print(f"Synced msgs with sequence number {seq}", msgs)

                # We have synced msgs, remove previous msgs (memory cleaning)
                for rm in seq_remove:
                    del self.msgs[rm]

                return msgs # Returned synced msgs

    return None # No synced msgs`

Henry · Mar 31, 2023

These are the two Python files needed to run the code.

erik · Apr 1, 2023

@Henry I believe this is far from minimal, please remove all unneeded code that still reproduced the issue.

Henry · Apr 1, 2023

The majority of the code is from depthai_gen2_age_gender repository. The only thing I modified is to do the crop
# Crop range topLeft = dai.Point2f(0.3, 0.3) bottomRight = dai.Point2f(0.7, 0.7) copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)

jakaskerl · Apr 1, 2023

Hi Henry ,
Please only post the code that is absolutely needed to reproduce the issue. I can see you have also modified camera resolution and applied resizing which is not present in the gen2_age_gender example.
Thanks,
Jaka

Henry · Apr 1, 2023

There are several places in the code where one can change the resolutions for different purposes:

In the ColorCamera:

cam = pipeline.create(dai.node.ColorCamera) cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP) cam.setIspScale(1,5) cam.setPreviewSize(676, 506)

In the face_det_manip

face_det_manip = pipeline.create(dai.node.ImageManip) face_det_manip.initialConfig.setResize(672, 384) face_det_manip.initialConfig.setKeepAspectRatio(False)

monoLeft, monoRight, StereoDepth
monoLeft = pipeline.create(dai.node.MonoCamera) monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P) monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT) monoRight = pipeline.create(dai.node.MonoCamera) monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P) monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT) stereo = pipeline.create(dai.node.StereoDepth) stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY) stereo.setDepthAlign(dai.CameraBoardSocket.RGB) stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight()) monoLeft.out.link(stereo.left) monoRight.out.link(stereo.right)

I understand that the colorCamera sends data to copy_manip, which is used for object detection. The monoLeft, monoRight, and stereoDepth cameras calculate the disparity map, which ultimately provides us with the depth map. When one wants to calculate the coordinates for the bounding boxes in the detection, it is necessary to combine the detection information with the depth map. How does depthAI handle this? Thanks.

Best

Henry