• DepthAI
  • Maintaining Z-Coordinate Reliability in Cropped Images

Hello DepthAI Community,

I have been working on a project that involves the use of the DepthAI framework, and we have recently encountered an issue that we hope you can help us resolve.

In our main.py file, our objective is to perform object detection on a cropped image acquired from the Image Signal Processor (ISP). To give you a better idea, here's a snippet of our code where the image is cropped
`
def create_pipeline(stereo):
pipeline = dai.Pipeline()

print("Creating Color Camera...")
cam = pipeline.create(dai.node.ColorCamera)
cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP)
cam.setIspScale(1,2)
cam.setPreviewSize(2028, 1520)
cam.setInterleaved(False)
cam.setBoardSocket(dai.CameraBoardSocket.RGB)

# Workaround: remove in 2.18, use `cam.setPreviewNumFramesPool(10)`
# This manip uses 15*3.5 MB => 52 MB of RAM.
copy_manip = pipeline.create(dai.node.ImageManip)
copy_manip.setNumFramesPool(15)
# copy_manip.setMaxOutputFrameSize(3499200)
copy_manip.setMaxOutputFrameSize(2028*1520*3)

# Crop range
topLeft = dai.Point2f(0.3, 0.3)
bottomRight = dai.Point2f(0.7, 0.7)
copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)

cam.preview.link(copy_manip.inputImage)

cam_xout = pipeline.create(dai.node.XLinkOut)
cam_xout.setStreamName("color")
copy_manip.out.link(cam_xout.input)

# ImageManip will resize the frame before sending it to the Face detection NN node
face_det_manip = pipeline.create(dai.node.ImageManip)
face_det_manip.initialConfig.setResize(300, 300)
face_det_manip.initialConfig.setKeepAspectRatio(False)
face_det_manip.initialConfig.setFrameType(dai.RawImgFrame.Type.RGB888p)
copy_manip.out.link(face_det_manip.inputImage)

if stereo:
    monoLeft = pipeline.create(dai.node.MonoCamera)
    monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)

    monoRight = pipeline.create(dai.node.MonoCamera)
    monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
    monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)

    stereo = pipeline.create(dai.node.StereoDepth)
    stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
    stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
    
    stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())
    monoLeft.out.link(stereo.left)
    monoRight.out.link(stereo.right)

    # Spatial Detection network if OAK-D
    print("OAK-D detected, app will display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetSpatialDetectionNetwork)
    face_det_nn.setBoundingBoxScaleFactor(0.8)
    face_det_nn.setDepthLowerThreshold(100)
    face_det_nn.setDepthUpperThreshold(5000)
    stereo.depth.link(face_det_nn.inputDepth)
else: # Detection network if OAK-1
    print("OAK-1 detected, app won't display spatial coordiantes")
    face_det_nn = pipeline.create(dai.node.MobileNetDetectionNetwork)

face_det_nn.setConfidenceThreshold(0.5)
face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-retail-0004", shaves=6))
face_det_nn.input.setQueueSize(1)
face_det_manip.out.link(face_det_nn.input)

# Send face detections to the host (for bounding boxes)
face_det_xout = pipeline.create(dai.node.XLinkOut)
face_det_xout.setStreamName("detection")
face_det_nn.out.link(face_det_xout.input)

`
After executing the cropping process, we have noticed that the reliability of the z-coordinate's output becomes compromised:

coords = "Z: {:.2f} m".format(detection.spatialCoordinates.z/1000)

We would greatly appreciate any guidance on how to properly crop the image while maintaining the accuracy and reliability of the z-coordinate. Any insights, suggestions, or examples you can share would be incredibly helpful.

Thank you in advance for your assistance, and we look forward to hearing from you soon.

Best regards,

Henry

    Hi Henry
    Would you mind sending the Minimal Reproducible Example (MRE) code so we can run and debug the code?

    Thanks,
    Jaka

    The code contains two files. Do you want me to copy & paste the codes here?

    Best

    Henry

      main.py

      `from MultiMsgSync import TwoStageHostSeqSync
      import blobconverter
      import cv2
      import depthai as dai
      import numpy as np
      import datetime
      import time

      print("DepthAI version", dai.version)
      def frame_norm(frame, bbox):
      normVals = np.full(len(bbox), frame.shape[0])
      normVals[::2] = frame.shape[1]
      return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

      def create_pipeline(stereo):
      pipeline = dai.Pipeline()

      print("Creating Color Camera...")
      cam = pipeline.create(dai.node.ColorCamera)
      cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP)
      cam.setIspScale(1,5)
      cam.setPreviewSize(676, 506)
      cam.setInterleaved(False)
      cam.setBoardSocket(dai.CameraBoardSocket.RGB)
      cam.setFps(10)
      
      # Workaround: remove in 2.18, use `cam.setPreviewNumFramesPool(10)`
      # This manip uses 15*3.5 MB => 52 MB of RAM.
      copy_manip = pipeline.create(dai.node.ImageManip)
      copy_manip.setNumFramesPool(15)
      # copy_manip.setMaxOutputFrameSize(3499200)
      copy_manip.setMaxOutputFrameSize(676*506*3)
      
      # Crop range
      topLeft = dai.Point2f(0.3, 0.3)
      bottomRight = dai.Point2f(0.7, 0.7)
      # copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)
      
      cam.preview.link(copy_manip.inputImage)
      
      cam_xout = pipeline.create(dai.node.XLinkOut)
      cam_xout.setStreamName("color")
      copy_manip.out.link(cam_xout.input)
      
      # ImageManip will resize the frame before sending it to the Face detection NN node
      face_det_manip = pipeline.create(dai.node.ImageManip)
      # face_det_manip.initialConfig.setResize(300, 300)
      face_det_manip.initialConfig.setResize(672, 384)
      face_det_manip.initialConfig.setKeepAspectRatio(False)
      face_det_manip.initialConfig.setFrameType(dai.RawImgFrame.Type.RGB888p)
      copy_manip.out.link(face_det_manip.inputImage)
      
      if stereo:
          monoLeft = pipeline.create(dai.node.MonoCamera)
          monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
          monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)
      
          monoRight = pipeline.create(dai.node.MonoCamera)
          monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
          monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)
      
          stereo = pipeline.create(dai.node.StereoDepth)
          stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
          stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
          
          stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())
          monoLeft.out.link(stereo.left)
          monoRight.out.link(stereo.right)
      
          # Spatial Detection network if OAK-D
          print("OAK-D detected, app will display spatial coordiantes")
          face_det_nn = pipeline.create(dai.node.MobileNetSpatialDetectionNetwork)
          face_det_nn.setBoundingBoxScaleFactor(0.8)
          face_det_nn.setDepthLowerThreshold(100)
          face_det_nn.setDepthUpperThreshold(5000)
          stereo.depth.link(face_det_nn.inputDepth)
      else: # Detection network if OAK-1
          print("OAK-1 detected, app won't display spatial coordiantes")
          face_det_nn = pipeline.create(dai.node.MobileNetDetectionNetwork)
      
      face_det_nn.setConfidenceThreshold(0.5)
      # face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-retail-0004", shaves=6))
      face_det_nn.setBlobPath(blobconverter.from_zoo(name="face-detection-adas-0001", shaves=6))
      face_det_nn.input.setQueueSize(1)
      face_det_manip.out.link(face_det_nn.input)
      
      # Send face detections to the host (for bounding boxes)
      face_det_xout = pipeline.create(dai.node.XLinkOut)
      face_det_xout.setStreamName("detection")
      face_det_nn.out.link(face_det_xout.input)
      
      # Script node will take the output from the face detection NN as an input and set ImageManipConfig
      # to the 'recognition_manip' to crop the initial frame
      image_manip_script = pipeline.create(dai.node.Script)
      face_det_nn.out.link(image_manip_script.inputs['face_det_in'])
      
      # Remove in 2.18 and use `imgFrame.getSequenceNum()` in Script node
      face_det_nn.passthrough.link(image_manip_script.inputs['passthrough'])
      
      copy_manip.out.link(image_manip_script.inputs['preview'])
      
      image_manip_script.setScript("""
      import time
      msgs = dict()
      
      def add_msg(msg, name, seq = None):
          global msgs
          if seq is None:
              seq = msg.getSequenceNum()
          seq = str(seq)
          # node.warn(f"New msg {name}, seq {seq}")
      
          # Each seq number has it's own dict of msgs
          if seq not in msgs:
              msgs[seq] = dict()
          msgs[seq][name] = msg
      
          # To avoid freezing (not necessary for this ObjDet model)
          if 15 < len(msgs):
              node.warn(f"Removing first element! len {len(msgs)}")
              msgs.popitem() # Remove first element
      
      def get_msgs():
          global msgs
          seq_remove = [] # Arr of sequence numbers to get deleted
          for seq, syncMsgs in msgs.items():
              seq_remove.append(seq) # Will get removed from dict if we find synced msgs pair
              # node.warn(f"Checking sync {seq}")
      
              # Check if we have both detections and color frame with this sequence number
              if len(syncMsgs) == 2: # 1 frame, 1 detection
                  for rm in seq_remove:
                      del msgs[rm]
                  # node.warn(f"synced {seq}. Removed older sync values. len {len(msgs)}")
                  return syncMsgs # Returned synced msgs
          return None
      
      def correct_bb(bb):
          if bb.xmin < 0: bb.xmin = 0.001
          if bb.ymin < 0: bb.ymin = 0.001
          if bb.xmax > 1: bb.xmax = 0.999
          if bb.ymax > 1: bb.ymax = 0.999
          return bb
      
      while True:
          time.sleep(0.001) # Avoid lazy looping
      
          preview = node.io['preview'].tryGet()
          if preview is not None:
              add_msg(preview, 'preview')
      
          face_dets = node.io['face_det_in'].tryGet()
          if face_dets is not None:
              # TODO: in 2.18.0.0 use face_dets.getSequenceNum()
              passthrough = node.io['passthrough'].get()
              seq = passthrough.getSequenceNum()
              add_msg(face_dets, 'dets', seq)
      
          sync_msgs = get_msgs()
          if sync_msgs is not None:
              img = sync_msgs['preview']
              dets = sync_msgs['dets']
              for i, det in enumerate(dets.detections):
                  cfg = ImageManipConfig()
                  correct_bb(det)
                  cfg.setCropRect(det.xmin, det.ymin, det.xmax, det.ymax)
                  # node.warn(f"Sending {i + 1}. det. Seq {seq}. Det {det.xmin}, {det.ymin}, {det.xmax}, {det.ymax}")
                  cfg.setResize(62, 62)
                  cfg.setKeepAspectRatio(False)
                  node.io['manip_cfg'].send(cfg)
                  node.io['manip_img'].send(img)
      """)
      
      recognition_manip = pipeline.create(dai.node.ImageManip)
      recognition_manip.initialConfig.setResize(62, 62)
      recognition_manip.setWaitForConfigInput(True)
      image_manip_script.outputs['manip_cfg'].link(recognition_manip.inputConfig)
      image_manip_script.outputs['manip_img'].link(recognition_manip.inputImage)
      
      # Second stange recognition NN
      print("Creating recognition Neural Network...")
      recognition_nn = pipeline.create(dai.node.NeuralNetwork)
      recognition_nn.setBlobPath(blobconverter.from_zoo(name="age-gender-recognition-retail-0013", shaves=6))
      recognition_manip.out.link(recognition_nn.input)
      
      recognition_xout = pipeline.create(dai.node.XLinkOut)
      recognition_xout.setStreamName("recognition")
      recognition_nn.out.link(recognition_xout.input)
      
      return pipeline

      prefix = './images/' + datetime.datetime.now().strftime("%Y-%m-%d %H%M%S_")
      startTime = time.time()
      count = 0
      wait_time = 200
      max_images = 10
      with dai.Device() as device:
      stereo = 1 < len(device.getConnectedCameras())
      device.startPipeline(create_pipeline(stereo))

      sync = TwoStageHostSeqSync()
      queues = {}
      # Create output queues
      for name in ["color", "detection", "recognition"]:
          queues[name] = device.getOutputQueue(name)
      
      while True:
          for name, q in queues.items():
              # Add all msgs (color frames, object detections and recognitions) to the Sync class.
              if q.has():
                  sync.add_msg(q.get(), name)
      
          msgs = sync.get_msgs()
      
          if msgs is not None:
              frame = msgs["color"].getCvFrame()
      
              if time.time() - startTime > wait_time:
                  count += 1
                  if count < max_images:
                      cv2.imwrite(prefix + str(count) + '.jpg', frame)
                  else:
                      break
      
              detections = msgs["detection"].detections
              recognitions = msgs["recognition"]
      
              for i, detection in enumerate(detections):
                  bbox = frame_norm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
      
                  # Decoding of recognition results
                  rec = recognitions[i]
                  age = int(float(np.squeeze(np.array(rec.getLayerFp16('age_conv3')))) * 100)
                  gender = np.squeeze(np.array(rec.getLayerFp16('prob')))
                  gender_str = "female" if gender[0] > gender[1] else "male"
      
                  cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (10, 245, 10), 2)
                  y = (bbox[1] + bbox[3]) // 2
                  cv2.putText(frame, str(age), (bbox[0], y), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (0, 0, 0), 8)
                  cv2.putText(frame, str(age), (bbox[0], y), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (255, 255, 255), 2)
                  cv2.putText(frame, gender_str, (bbox[0], y + 30), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (0, 0, 0), 8)
                  cv2.putText(frame, gender_str, (bbox[0], y + 30), cv2.FONT_HERSHEY_TRIPLEX, 1.5, (255, 255, 255), 2)
                  if stereo:
                      # You could also get detection.spatialCoordinates.x and detection.spatialCoordinates.y coordinates
                      coords = "Z: {:.2f} m".format(detection.spatialCoordinates.z/1000)
                      cv2.putText(frame, coords, (bbox[0], y + 60), cv2.FONT_HERSHEY_TRIPLEX, 1, (0, 0, 0), 8)
                      cv2.putText(frame, coords, (bbox[0], y + 60), cv2.FONT_HERSHEY_TRIPLEX, 1, (255, 255, 255), 2)
      
              if time.time() - startTime < wait_time:
                  cv2.putText(frame, "Wait for " + str(wait_time - int(time.time() - startTime)) + " seconds.", (frame.shape[1]-400, 50), cv2.FONT_HERSHEY_TRIPLEX, 1, (255, 255, 255), 2)
      
              cv2.imshow("Camera", frame)
          if cv2.waitKey(1) == ord('q'):
              break

      `

      Hi Henry
      Yes, but only paste the minimal code you need for the example to work.
      Thanks,
      Jaka

      MultiMsgSync.py

      `# Color frames (ImgFrame), object detection (ImgDetections) and recognition (NNData)

      messages arrive to the host all with some additional delay.

      For each ImgFrame there's one ImgDetections msg, which has multiple detections, and for each

      detection there's a NNData msg which contains recognition results.

      How it works:

      Every ImgFrame, ImgDetections and NNData message has it's own sequence number, by which we can sync messages.

      class TwoStageHostSeqSync:
      def init(self):
      self.msgs = {}

      name: color, detection, or recognition

      def add_msg(self, msg, name):
      seq = str(msg.getSequenceNum())
      if seq not in self.msgs:
      self.msgs[seq] = {} # Create directory for msgs
      if "recognition" not in self.msgs[seq]:
      self.msgs[seq]["recognition"] = [] # Create recognition array

          if name == "recognition":
              # Append recognition msgs to an array
              self.msgs[seq]["recognition"].append(msg)
              # print(f'Added recognition seq {seq}, total len {len(self.msgs[seq]["recognition"])}')
      
          elif name == "detection":
              # Save detection msg in the directory
              self.msgs[seq][name] = msg
              self.msgs[seq]["len"] = len(msg.detections)
              # print(f'Added detection seq {seq}')
      
          elif name == "color": # color
              # Save color frame in the directory
              self.msgs[seq][name] = msg
              # print(f'Added frame seq {seq}')
      
      
      def get_msgs(self):
          seq_remove = [] # Arr of sequence numbers to get deleted
      
          for seq, msgs in self.msgs.items():
              seq_remove.append(seq) # Will get removed from dict if we find synced msgs pair
      
              # Check if we have both detections and color frame with this sequence number
              if "color" in msgs and "len" in msgs:
      
                  # Check if all detected objects (faces) have finished recognition inference
                  if msgs["len"] == len(msgs["recognition"]):
                      # print(f"Synced msgs with sequence number {seq}", msgs)
      
                      # We have synced msgs, remove previous msgs (memory cleaning)
                      for rm in seq_remove:
                          del self.msgs[rm]
      
                      return msgs # Returned synced msgs
      
          return None # No synced msgs`

      These are the two Python files needed to run the code.

      @Henry I believe this is far from minimal, please remove all unneeded code that still reproduced the issue.

      The majority of the code is from depthai_gen2_age_gender repository. The only thing I modified is to do the crop
      # Crop range
      topLeft = dai.Point2f(0.3, 0.3)
      bottomRight = dai.Point2f(0.7, 0.7)
      copy_manip.initialConfig.setCropRect(topLeft.x, topLeft.y, bottomRight.x, bottomRight.y)

        Hi Henry ,
        Please only post the code that is absolutely needed to reproduce the issue. I can see you have also modified camera resolution and applied resizing which is not present in the gen2_age_gender example.
        Thanks,
        Jaka

        There are several places in the code where one can change the resolutions for different purposes:

        1. In the ColorCamera:

        cam = pipeline.create(dai.node.ColorCamera)
        cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_12_MP)
        cam.setIspScale(1,5)
        cam.setPreviewSize(676, 506)

        1. In the face_det_manip

        face_det_manip = pipeline.create(dai.node.ImageManip)
        face_det_manip.initialConfig.setResize(672, 384)
        face_det_manip.initialConfig.setKeepAspectRatio(False)

        1. monoLeft, monoRight, StereoDepth
          monoLeft = pipeline.create(dai.node.MonoCamera)
          monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
          monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)
          monoRight = pipeline.create(dai.node.MonoCamera)
          monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_400_P)
          monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)
          stereo = pipeline.create(dai.node.StereoDepth)
          stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
          stereo.setDepthAlign(dai.CameraBoardSocket.RGB)
          stereo.setOutputSize(monoLeft.getResolutionWidth(), monoLeft.getResolutionHeight())
          monoLeft.out.link(stereo.left)
          monoRight.out.link(stereo.right)

        I understand that the colorCamera sends data to copy_manip, which is used for object detection. The monoLeft, monoRight, and stereoDepth cameras calculate the disparity map, which ultimately provides us with the depth map. When one wants to calculate the coordinates for the bounding boxes in the detection, it is necessary to combine the detection information with the depth map. How does depthAI handle this? Thanks.

        Best

        Henry