Pose Landmarking and Spatial Data

jsiic · Nov 13, 2023

Hi,

I am trying to use Oak-D to get spatial data of the pose landmarks from mediapipe. For example, XYZ data of all the points that mark the joints.

I think the way Oak-D works for spatial detection on objects is that it draws a box around an object, and then averages out all of the xyz data of the points in the box from a depth map (created by using disparity matching), is this correct?

If so, what happens when a box is drawn around a person, and then some of the points in the box are objects that are further behind the person (such as the wall)?

In the case of getting spatial data for the body landmarks, do you think it's wiser to draw boxes around the landmarks and then average out the depth data? Or just get the depth data precisely at the landmark points?

Thanks in advance,
Jae

jakaskerl · Nov 14, 2023

Hi jsiic
For SpatialYolo and SpatialMBnetSSD, a bounding box is drawn around the detected person (this includes some of the surrounding area) and a smaller ROI is created that should lay completely on the person. example demo. This is to ensure the calculated average is based on the pixels that actually are a part of the detected person.

If you wish to get a more accurate location, you could use the pixels around the keypoints (mainly shoulders and hips) to average out the position. But in this case you would likely need your own code for calculating spatials, since multiple ROIs are not possible on the spatial calculator as of now.

Thoughts?
Jaka

jsiic · Nov 15, 2023

Hey jakaskerl

It's fine to write my own code for calculating spatials. But I am wondering if there is a more straightforward way of accessing the xyz spatial data?

For example, let's say:
Let L=(h, v) where h, v refers to the (horizontal, vertical) pixel location value. So on a 720p resolution image, (1280, 720) would be bottom right corner.

Using Oak-D and google mediapipe (running on my computer), my code will locate where the 33 landmarks are. L1, ... , L33 = (h1, v1), ... (h33, v33).

Then I just want it to look up those coordinates on the Oak-D spatial map (depth map?), look up the xyz spatial coordinates, and return those values. So if L15 = (920, 385), then I get something like (x=1m, y=2.5m, z=2m)

If I can accomplish this, then I think I can also write the code to test out different ROI's and average them out on the computer.

What would be the most direct way of just getting those spatial values?

jakaskerl · Nov 15, 2023

Hi jsiic
There is no direct way really, you basically need a depth map and then calculate the x and y based on the depth map and the pixel position.
This is the same thing we do with calc-spatials-on-host, (https://github.com/luxonis/depthai-experiments/blob/master/gen2-calc-spatials-on-host/calc.py is the L(h, v) function basically). You can adapt this to run for every keypoint.

Thanks,
Jaka

jsiic · Nov 16, 2023

Hey jakaskerl,

So I read through the host side spatials stuff and it makes a lot of sense. But now I am having issues when I combine it with my mediapipe pose estimation script that runs on Oak-D mono feed. It runs for a few frames, prints out the data, and then freezes. I think the issue is coming from the way I set up the pipeline using the API. I am setting up a mono left and stereo depth camera but maybe I am doing this the wrong way. If you could take a look at my code, that would be great!

import cv2
import depthai as dai
import mediapipe as mp
from calc import HostSpatialsCalc
from utility import *

# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
monoLeft = pipeline.create(dai.node.MonoCamera)
monoRight = pipeline.create(dai.node.MonoCamera)
stereo = pipeline.create(dai.node.StereoDepth)

# Properties
monoLeft.setResolution(dai.MonoCameraProperties.SensorResolution.THE_720_P)
monoLeft.setBoardSocket(dai.CameraBoardSocket.LEFT)
monoRight.setResolution(dai.MonoCameraProperties.SensorResolution.THE_720_P)
monoRight.setBoardSocket(dai.CameraBoardSocket.RIGHT)

stereo.initialConfig.setConfidenceThreshold(255)
stereo.setLeftRightCheck(True)
stereo.setSubpixel(False)

#Linking Left
xoutLeft = pipeline.create(dai.node.XLinkOut)
xoutLeft.setStreamName('left')
monoLeft.out.link(xoutLeft.input)

# Linking depth
monoLeft.out.link(stereo.left)
monoRight.out.link(stereo.right)

xoutDepth = pipeline.create(dai.node.XLinkOut)
xoutDepth.setStreamName("depth")
stereo.depth.link(xoutDepth.input)

xoutDepth = pipeline.create(dai.node.XLinkOut)
xoutDepth.setStreamName("disp")
stereo.disparity.link(xoutDepth.input)

#pose
mpDraw = mp.solutions.drawing_utils
mpPose = mp.solutions.pose
pose = mpPose.Pose()


# Connect to device and start pipeline
with dai.Device(pipeline) as device:

    # Output queues will be used to get the grayscale frames from the outputs defined above
    qLeft = device.getOutputQueue(name="left", maxSize=4, blocking=False)
    depthQueue = device.getOutputQueue(name="depth")

    device.setIrLaserDotProjectorBrightness(200)  # in mA, 0..1200
    device.setIrFloodLightBrightness(1000)  # in mA, 0..1500

    #spatials
    hostSpatials = HostSpatialsCalc(device)


    while True:
        # Instead of get (blocking), we use tryGet (non-blocking) which will return the available data or None otherwise
        inLeft = qLeft.tryGet()
        depthData = depthQueue.get()

        if inLeft is not None:
            rgb_img = cv2.cvtColor(inLeft.getCvFrame(), cv2.COLOR_GRAY2RGB)
            results = pose.process(rgb_img)

            if results.pose_landmarks:

                mpDraw.draw_landmarks(rgb_img, results.pose_landmarks, mpPose.POSE_CONNECTIONS)
                cv2.imshow("Image", rgb_img)
                for id, lm in enumerate(results.pose_landmarks.landmark):
                    h, w, c = rgb_img.shape
                    cx, cy = int(lm.x * w), int(lm.y * h)
                    spatials, centroid = hostSpatials.calc_spatials(depthData, (cx, cy))  # centroid == x/y in our case
                    print(id, cx, cy, spatials)

        if cv2.waitKey(1) == ord('q'):
            break

jakaskerl · Nov 16, 2023

Hi jsiic
Haven't checked deeper, but it think the queues get saturated since you are not consuming the disparity output in on the host side even though you have created the XlinkOut node for it.

jsiic
xoutDepth = pipeline.create(dai.node.XLinkOut)
xoutDepth.setStreamName("disp")
stereo.disparity.link(xoutDepth.input)

Thanks,
Jaka

jsiic · Nov 16, 2023

Thanks! jakaskerl

That worked like a charm!

But it got me thinking.... does it make a difference if I use left or right camera? They are physically a few centimeters apart and it seems like the spatial map is supposed to do a calculation off of those two... which would put the object on left/right camera a few pixels off from the spatial map?

jakaskerl · Nov 17, 2023

Hi jsiic
The depth map is by default aligned to the right mono camera (could be the left, not completely sure). You can set the alignment to any camera you wish, even the RGB. That way the depth pixels lie on top of the pixels from the RGB sensor.

Thanks,
Jaka