Hi,

I am trying to use Oak-D to get spatial data of the pose landmarks from mediapipe. For example, XYZ data of all the points that mark the joints.

I think the way Oak-D works for spatial detection on objects is that it draws a box around an object, and then averages out all of the xyz data of the points in the box from a depth map (created by using disparity matching), is this correct?

If so, what happens when a box is drawn around a person, and then some of the points in the box are objects that are further behind the person (such as the wall)?

In the case of getting spatial data for the body landmarks, do you think it's wiser to draw boxes around the landmarks and then average out the depth data? Or just get the depth data precisely at the landmark points?

Thanks in advance,
Jae

    Hi jsiic
    For SpatialYolo and SpatialMBnetSSD, a bounding box is drawn around the detected person (this includes some of the surrounding area) and a smaller ROI is created that should lay completely on the person. example demo. This is to ensure the calculated average is based on the pixels that actually are a part of the detected person.

    If you wish to get a more accurate location, you could use the pixels around the keypoints (mainly shoulders and hips) to average out the position. But in this case you would likely need your own code for calculating spatials, since multiple ROIs are not possible on the spatial calculator as of now.

    Thoughts?
    Jaka

      Hey jakaskerl

      It's fine to write my own code for calculating spatials. But I am wondering if there is a more straightforward way of accessing the xyz spatial data?

      For example, let's say:
      Let L=(h, v) where h, v refers to the (horizontal, vertical) pixel location value. So on a 720p resolution image, (1280, 720) would be bottom right corner.

      Using Oak-D and google mediapipe (running on my computer), my code will locate where the 33 landmarks are. L1, ... , L33 = (h1, v1), ... (h33, v33).

      Then I just want it to look up those coordinates on the Oak-D spatial map (depth map?), look up the xyz spatial coordinates, and return those values. So if L15 = (920, 385), then I get something like (x=1m, y=2.5m, z=2m)

      If I can accomplish this, then I think I can also write the code to test out different ROI's and average them out on the computer.

      What would be the most direct way of just getting those spatial values?

        Hey jakaskerl,

        So I read through the host side spatials stuff and it makes a lot of sense. But now I am having issues when I combine it with my mediapipe pose estimation script that runs on Oak-D mono feed. It runs for a few frames, prints out the data, and then freezes. I think the issue is coming from the way I set up the pipeline using the API. I am setting up a mono left and stereo depth camera but maybe I am doing this the wrong way. If you could take a look at my code, that would be great!

        import cv2
        import depthai as dai
        import mediapipe as mp
        from calc import HostSpatialsCalc
        from utility import *
        
        # Create pipeline
        pipeline = dai.Pipeline()
        
        # Define sources and outputs
        monoLeft = pipeline.create(dai.node.MonoCamera)
        monoRight = pipeline.create(dai.node.MonoCamera)
        stereo = pipeline.create(dai.node.StereoDepth)
        
        # Properties
        monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_720_P)
        monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)
        monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_720_P)
        monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)
        
        stereo.initialConfig.setConfidenceThreshold(255)
        stereo.setLeftRightCheck(True)
        stereo.setSubpixel(False)
        
        #Linking Left
        xoutLeft = pipeline.create(dai.node.XLinkOut)
        xoutLeft.setStreamName('left')
        monoLeft.out.link(xoutLeft.input)
        
        # Linking depth
        monoLeft.out.link(stereo.left)
        monoRight.out.link(stereo.right)
        
        xoutDepth = pipeline.create(dai.node.XLinkOut)
        xoutDepth.setStreamName("depth")
        stereo.depth.link(xoutDepth.input)
        
        xoutDepth = pipeline.create(dai.node.XLinkOut)
        xoutDepth.setStreamName("disp")
        stereo.disparity.link(xoutDepth.input)
        
        #pose
        mpDraw = mp.solutions.drawing_utils
        mpPose = mp.solutions.pose
        pose = mpPose.Pose()
        
        
        # Connect to device and start pipeline
        with dai.Device(pipeline) as device:
        
            # Output queues will be used to get the grayscale frames from the outputs defined above
            qLeft = device.getOutputQueue(name="left", maxSize=4, blocking=False)
            depthQueue = device.getOutputQueue(name="depth")
        
            device.setIrLaserDotProjectorBrightness(200)  # in mA, 0..1200
            device.setIrFloodLightBrightness(1000)  # in mA, 0..1500
        
            #spatials
            hostSpatials = HostSpatialsCalc(device)
        
        
            while True:
                # Instead of get (blocking), we use tryGet (non-blocking) which will return the available data or None otherwise
                inLeft = qLeft.tryGet()
                depthData = depthQueue.get()
        
                if inLeft is not None:
                    rgb_img = cv2.cvtColor(inLeft.getCvFrame(), cv2.COLOR_GRAY2RGB)
                    results = pose.process(rgb_img)
        
                    if results.pose_landmarks:
        
                        mpDraw.draw_landmarks(rgb_img, results.pose_landmarks, mpPose.POSE_CONNECTIONS)
                        cv2.imshow("Image", rgb_img)
                        for id, lm in enumerate(results.pose_landmarks.landmark):
                            h, w, c = rgb_img.shape
                            cx, cy = int(lm.x * w), int(lm.y * h)
                            spatials, centroid = hostSpatials.calc_spatials(depthData, (cx, cy))  # centroid == x/y in our case
                            print(id, cx, cy, spatials)
        
                if cv2.waitKey(1) == ord('q'):
                    break

          Hi jsiic
          Haven't checked deeper, but it think the queues get saturated since you are not consuming the disparity output in on the host side even though you have created the XlinkOut node for it.

          jsiic
          xoutDepth = pipeline.create(dai.node.XLinkOut)
          xoutDepth.setStreamName("disp")
          stereo.disparity.link(xoutDepth.input)

          Thanks,
          Jaka

            Thanks! jakaskerl

            That worked like a charm!

            But it got me thinking.... does it make a difference if I use left or right camera? They are physically a few centimeters apart and it seems like the spatial map is supposed to do a calculation off of those two... which would put the object on left/right camera a few pixels off from the spatial map?

              Hi jsiic
              The depth map is by default aligned to the right mono camera (could be the left, not completely sure). You can set the alignment to any camera you wish, even the RGB. That way the depth pixels lie on top of the pixels from the RGB sensor.

              Thanks,
              Jaka