Having trouble with results from human-pose-estimation-0001

GaryG

I'm trying to get something to a point where some middle school students can experiment with it, and I'd be able to hand it off except the data isn't making sense. Plus, I'd need to correlate the detection points with their interpretation, e.g., elbow, shoulder, etc.

The thing that has me 100% stuck right now is getting near-zero values for xmax and ymax. If I can fix that, at least I can show it to them tomorrow.

I'll post my Python code, a portion of my shell output, and a screen shot. If there's anything else I can provide, or provide differently, please let me know.

Thanks!

# The starting point for this was:
# https://github.com/luxonis/depthai-tutorials/blob/master/1-hello-world/hello_world.py
# Description is at https://docs.luxonis.com/projects/api/en/latest/tutorials/hello_world

# first, import all necessary modules
from pathlib import Path

import blobconverter
import cv2
import depthai
import numpy as np

import inspect

# Pipeline tells DepthAI what operations to perform when running - you define all of the resources used and flows here
pipeline = depthai.Pipeline()
print("pipeline =", pipeline)

# First, we want the Color camera as the output
cam_rgb = pipeline.createColorCamera()
cam_rgb.setPreviewSize(456, 256)  # the preview frame size, available as 'preview' output of the node # 0001
#cam_rgb.setPreviewSize(448, 448)  # the preview frame size, available as 'preview' output of the node # 0007
cam_rgb.setInterleaved(False)
#cam_rgb.setFps(1)

# Next, we want a neural network that will produce the detections
detection_nn = pipeline.createMobileNetDetectionNetwork()
# Blob is the Neural Network file, compiled for MyriadX. It contains both the definition and weights of the model
# We're using a blobconverter tool to retreive the MobileNetSSD blob automatically from OpenVINO Model Zoo
detection_nn.setBlobPath(blobconverter.from_zoo(name='human-pose-estimation-0001', shaves=6))
# Next, we filter out the detections that are below a confidence threshold. Confidence can be anywhere between <0..1>
detection_nn.setConfidenceThreshold(0.5)
# Next, we link the camera 'preview' output to the neural network detection input, so that it can produce detections
cam_rgb.preview.link(detection_nn.input)

# XLinkOut is a "way out" from the device. Any data you want to transfer to host need to be send via XLink
xout_rgb = pipeline.createXLinkOut()
# For the rgb camera output, we want the XLink stream to be named "rgb"
xout_rgb.setStreamName("rgb")
# Linking camera preview to XLink input, so that the frames will be sent to host
cam_rgb.preview.link(xout_rgb.input)

# The same XLinkOut mechanism will be used to receive nn results
xout_nn = pipeline.createXLinkOut()
xout_nn.setStreamName("nn")
detection_nn.out.link(xout_nn.input)

# Pipeline is now finished, and we need to find an available device to run our pipeline
# we are using context manager here that will dispose the device after we stop using it
with depthai.Device(pipeline) as device:
    # From this point, the Device will be in "running" mode and will start sending data via XLink

    # To consume the device results, we get two output queues from the device, with stream names we assigned earlier
    q_rgb = device.getOutputQueue("rgb")
    q_nn = device.getOutputQueue("nn")

    # Here, some of the default values are defined. Frame will be an image from "rgb" stream, detections will contain nn results
    frame = None
    detections = []

    # Since the detections returned by nn have values from <0..1> range, they need to be multiplied by frame width/height to
    # receive the actual position of the bounding box on the image
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)


    # Main host-side application loop
    while True:
        # we try to fetch the data from nn/rgb queues. tryGet will return either the data packet or None if there isn't any
        in_rgb = q_rgb.tryGet()
        in_nn = q_nn.tryGet()

        if in_rgb is not None:
            # If the packet from RGB camera is present, we're retrieving the frame in OpenCV format using getCvFrame
            frame = in_rgb.getCvFrame()

            if frame is not None and in_nn is not None:
                # when data from nn is received, we take the detections array that contains mobilenet-ssd results
                detections = in_nn.detections
                if len(detections) > 0:
                    #print("detections=", detections)
                    for detection in detections:
                        label = detection.label
                        confidence = detection.confidence
                        xmin = detection.xmin
                        ymin = detection.ymin
                        xmax = detection.xmax
                        ymax = detection.ymax
                        #print(detection)
                        print("label =", label, "; confidence =", confidence, "; xmin =", xmin, "; ymin =", ymin, "; xmax =", xmax, "; ymax =", ymax)
                        bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
                        cv2.line(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 0, 0), 2)
                        # After the drawing is finished, we show the frame on the screen
                        cv2.imshow("preview", frame)

        # at any time, you can press "q" and exit the main loop, therefore exiting the program itself
        if cv2.waitKey(1) == ord('q'):
            break

label = 0 ; confidence = 0.74169921875 ; xmin = 0.7783203125 ; ymin = 0.246826171875 ; xmax = 0.00054931640625 ; ymax = 0.00054931640625
label = 0 ; confidence = 0.60107421875 ; xmin = 0.42529296875 ; ymin = 0.105712890625 ; xmax = 0.03570556640625 ; ymax = 0.0181884765625
label = 0 ; confidence = 0.6875 ; xmin = 0.4169921875 ; ymin = 0.0377197265625 ; xmax = 0.00335693359375 ; ymax = 0.0025634765625
label = 0 ; confidence = 0.84716796875 ; xmin = 0.046630859375 ; ymin = 0.0 ; xmax = 6.103515625e-05 ; ymax = 0.0006103515625
label = 0 ; confidence = 0.837890625 ; xmin = 0.92333984375 ; ymin = 0.05877685546875 ; xmax = 6.103515625e-05 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.52783203125 ; xmin = 0.10552978515625 ; ymin = 0.00970458984375 ; xmax = 0.01348876953125 ; ymax = 0.00103759765625
label = 0 ; confidence = 0.9228515625 ; xmin = 0.92919921875 ; ymin = 0.91064453125 ; xmax = 0.8486328125 ; ymax = 0.6025390625
label = 0 ; confidence = 0.64892578125 ; xmin = 0.6376953125 ; ymin = 0.1900634765625 ; xmax = 0.01153564453125 ; ymax = 0.000244140625
label = 0 ; confidence = 0.91943359375 ; xmin = 0.6884765625 ; ymin = 0.02935791015625 ; xmax = 0.000244140625 ; ymax = 0.000244140625
label = 0 ; confidence = 0.58447265625 ; xmin = -6.103515625e-05 ; ymin = -6.103515625e-05 ; xmax = -6.103515625e-05 ; ymax = -6.103515625e-05
label = 0 ; confidence = 0.97265625 ; xmin = 0.57958984375 ; ymin = -6.103515625e-05 ; xmax = -6.103515625e-05 ; ymax = -6.103515625e-05
label = 0 ; confidence = 0.64794921875 ; xmin = 0.41455078125 ; ymin = 0.00909423828125 ; xmax = 0.00018310546875 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.52392578125 ; xmin = 0.390380859375 ; ymin = 0.00244140625 ; xmax = 0.0081787109375 ; ymax = -0.00048828125
label = 0 ; confidence = 0.8427734375 ; xmin = 0.06243896484375 ; ymin = 0.00213623046875 ; xmax = 0.003173828125 ; ymax = 0.0030517578125
label = 0 ; confidence = 0.93603515625 ; xmin = 0.77978515625 ; ymin = 0.02655029296875 ; xmax = -0.00146484375 ; ymax = 0.00042724609375
label = 0 ; confidence = 0.56884765625 ; xmin = 0.07354736328125 ; ymin = 0.00042724609375 ; xmax = -0.01214599609375 ; ymax = -0.01409912109375
label = 0 ; confidence = 0.8115234375 ; xmin = 0.464111328125 ; ymin = 0.0447998046875 ; xmax = 0.00042724609375 ; ymax = -0.00469970703125
label = 0 ; confidence = 0.619140625 ; xmin = 0.76171875 ; ymin = 0.308349609375 ; xmax = 0.01397705078125 ; ymax = 0.00042724609375
label = 0 ; confidence = 0.50341796875 ; xmin = 0.703125 ; ymin = 0.154296875 ; xmax = 0.0057373046875 ; ymax = 0.00653076171875
label = 0 ; confidence = 0.91357421875 ; xmin = 0.369873046875 ; ymin = 0.00054931640625 ; xmax = 0.00054931640625 ; ymax = 0.0015869140625
label = 0 ; confidence = 0.64599609375 ; xmin = 0.06475830078125 ; ymin = 0.00054931640625 ; xmax = 0.00054931640625 ; ymax = 0.0006103515625
label = 0 ; confidence = 0.79052734375 ; xmin = 0.767578125 ; ymin = 0.193603515625 ; xmax = 0.00054931640625 ; ymax = 0.00054931640625
label = 0 ; confidence = 0.6181640625 ; xmin = 0.424560546875 ; ymin = 0.09307861328125 ; xmax = 0.0433349609375 ; ymax = 0.0252685546875
label = 0 ; confidence = 0.681640625 ; xmin = 0.384521484375 ; ymin = 0.025390625 ; xmax = 0.00335693359375 ; ymax = 0.00225830078125
label = 0 ; confidence = 0.8330078125 ; xmin = 0.042236328125 ; ymin = -6.103515625e-05 ; xmax = 6.103515625e-05 ; ymax = 0.000732421875
label = 0 ; confidence = 0.8486328125 ; xmin = 0.9111328125 ; ymin = 0.0548095703125 ; xmax = 6.103515625e-05 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.5390625 ; xmin = 0.08209228515625 ; ymin = 0.00091552734375 ; xmax = 0.010986328125 ; ymax = 0.00189208984375
label = 0 ; confidence = 0.9208984375 ; xmin = 0.931640625 ; ymin = 0.9267578125 ; xmax = 0.85888671875 ; ymax = 0.603515625
label = 0 ; confidence = 0.63818359375 ; xmin = 0.611328125 ; ymin = 0.1787109375 ; xmax = 0.012451171875 ; ymax = 0.000244140625
label = 0 ; confidence = 0.91064453125 ; xmin = 0.6923828125 ; ymin = 0.03057861328125 ; xmax = 0.000244140625 ; ymax = 0.000244140625

jakaskerl

Hi GaryG
The model you are using is not MBnet. Anyway, GPT and I wrote you a custom decoding function. Now you have a list of points returned from the model, which you can use to draw out the skeleton. Keep in mind the points x, y (i multiplied them by 2 to fit the frame since I didn't write a scaling function).

# The starting point for this was:
# https://github.com/luxonis/depthai-tutorials/blob/master/1-hello-world/hello_world.py
# Description is at https://docs.luxonis.com/projects/api/en/latest/tutorials/hello_world

# first, import all necessary modules
from pathlib import Path

import blobconverter
import cv2
import depthai
import numpy as np
import time
import argparse
import depthai as dai
import inspect

import numpy as np

threshold = 0.4

def parse_heatmaps(heatmaps):
    keypoints = []
    for heatmap in heatmaps:
        # Find the pixel with the highest value
        y, x = np.unravel_index(np.argmax(heatmap), heatmap.shape)
        confidence = heatmap[y, x]
        if confidence > threshold:  # Define a threshold value
            keypoints.append((x, y, confidence))
    return keypoints

from scipy.ndimage.filters import maximum_filter

def non_maximum_suppression(heatmaps, window_size=3):
    # Apply a maximum filter
    max_filtered = maximum_filter(heatmaps, size=window_size)
    # Keep only the points that are equal to the local maximum
    peaks = (heatmaps == max_filtered) & (heatmaps > threshold)
    return peaks

def group_keypoints(keypoints, embeddings):
    poses = []
    # This is a simplified version. In practice, you'd use a more complex algorithm
    # to group keypoints based on the distance between their embeddings.
    for keypoint, embedding in zip(keypoints, embeddings):
        # Find which pose this keypoint belongs to based on the embedding
        pose_id = find_closest_pose(embedding, poses)
        if pose_id is not None:
            poses[pose_id].append(keypoint)
        else:
            poses.append([keypoint])
    return poses

def find_closest_pose(embedding, poses):
    # Implement a method to find the closest pose based on embeddings
    pass




# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
camRgb = pipeline.create(dai.node.ColorCamera)
nn = pipeline.create(dai.node.NeuralNetwork)
xoutRgb = pipeline.create(dai.node.XLinkOut)
nnOut = pipeline.create(dai.node.XLinkOut)

xoutRgb.setStreamName("rgb")
nnOut.setStreamName("nn")


# Properties
camRgb.setPreviewSize(288, 288)
camRgb.setInterleaved(False)
camRgb.setFps(40)
# Define a neural network that will make predictions based on the source frames
nn.setBlobPath(blobconverter.from_zoo(name='human-pose-estimation-0005', shaves=6))
nn.setNumInferenceThreads(2)
nn.input.setBlocking(False)

# Linking

nn.passthrough.link(xoutRgb.input)

camRgb.preview.link(nn.input)
nn.out.link(nnOut.input)


# Connect to device and start pipeline
with dai.Device(pipeline) as device:

    # Output queues will be used to get the rgb frames and nn data from the outputs defined above
    qRgb = device.getOutputQueue(name="rgb", maxSize=4, blocking=False)
    qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False)

    frame = None
    detections = []
    startTime = time.monotonic()
    counter = 0
    color2 = (255, 255, 255)

    # nn data (bounding box locations) are in <0..1> range - they need to be normalized with frame width/height
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

    def displayFrame(name, frame, points):
        color = (255, 0, 0)
        try:
            for point in points:
                point = point[0]
                cv2.circle(frame, (int(point[1]*2), int(point[2]*2)), 2, color, cv2.FILLED)
            # Show the frame
            cv2.imshow(name, frame)
        except Exception as e:
            print(e)

    printOutputLayersOnce = True

    while True:
        inRgb = qRgb.get()
        inDet = qDet.get()

        if inRgb is not None:
            frame = inRgb.getCvFrame()
            cv2.putText(frame, "NN fps: {:.2f}".format(counter / (time.monotonic() - startTime)),
                        (2, frame.shape[0] - 4), cv2.FONT_HERSHEY_TRIPLEX, 0.4, color2)

        

        # Assuming `heatmaps_output` and `embeddings_output` are the outputs from the model
        heatmaps_output = np.array(inDet.getLayerFp16("heatmaps")).reshape((1, 17, 144, 144))
        embeddings_output = np.array(inDet.getLayerFp16("embeddings")).reshape((1, 17, 144, 144))

        print(heatmaps_output.shape)
        print(embeddings_output.shape)

        # Parse the heatmaps to get initial keypoints
        keypoints = parse_heatmaps(heatmaps_output[0])

        # Apply non-maximum suppression to filter the keypoints
        peaks = non_maximum_suppression(heatmaps_output[0])

        # Now, we need to extract the keypoints from the peaks
        filtered_keypoints = []
        for i, peak in enumerate(peaks):
            y, x = np.where(peak)
            for j in range(len(x)):
                confidence = heatmaps_output[0][i, y[j], x[j]]
                filtered_keypoints.append((i, x[j], y[j], confidence))

        # Group keypoints into individual poses
        poses = group_keypoints(filtered_keypoints, embeddings_output[0])


        
        # Now `poses` contains grouped keypoints for each detected person

        # If the frame is available, draw bounding boxes on it and show the frame
        if frame is not None:
            displayFrame("rgb", frame, points=poses)

        if cv2.waitKey(1) == ord('q'):
            break

Hopefully you can now complete your code to show it to the students 🙂
Let me know if you need anything.

Thanks,
Jaka

GaryG

jakaskerl Thank you for your help! I only had time to paste our code in and give it a quick try. Due to the import of scipy I get an error about not finding libopenblas.so.0.

jakaskerl

GaryG
Maybe try to install a different version of scipy, or just install openblas manually.

GaryG

@jakaskerl that helped.

Things I did:

installed libopenblas-dev as seen in history
I'm teaching them with Thonny so I let Thonny (re)install scipy after the "global" libopenblas-dev install
Changed namespace for maximum_filter to avoid deprecation warning. (This might have helped with functionality, not sure.)
I'm on a Pi3, so fps 40 was extremely wishful thinking.

Sorry if this is disjointed; it's late and I start early.

Great that I have something to show tomorrow. But there's a lot more here than I can hope to explain, so I'll need to revisit this for next Wednesday. Hoping to keep it simpler like some of the online articles suggest it can be.

Also, we can assume one person only if that helps. (Their desired use case is giving someone feedback about a martial arts pose. I don't think it's Tai Chi but not sure what :-) I'm only an ad hoc volunteer offering to help a teacher at a different school in the district than the one my son goes to.)

I think there's supposed to be some text labeling but I don't see any (I made the text red vs blue -- which I made 3px vs 2px, so I'm definitely not seeing text, just fat blue dots.) I don't need timestamps etc., but "elbow", "eye", etc. would be soooo good.

Any additional help at streamlining the code (especially), and improving performance most appreciated. I do plan to move this to a pi4, and can drop the framerate even more if it would help.

I can push more code into functions myself to make it clearer if necessary, and just treat the functions as black boxes. But anything we can actually get rid of or get from a library is better. "Less is more."

Gary

  263  apt search libopenblas
  264  sudo apt install libopenblas-dev
  265  pip uninstall scipy

31c29
< from scipy.ndimage.filters import maximum_filter
---
> from scipy.ndimage import maximum_filter

76c70
< camRgb.setFps(40)
---
> camRgb.setFps(6)

jakaskerl

GaryG Hoping to keep it simpler like some of the online articles suggest it can be.

source? The Oak devices can only run on-device decoding for models like Yolo and MobileNetSSD. Others have to be decoded on host side.

GaryG I think there's supposed to be some text labeling but I don't see any (I made the text red vs blue -- which I made 3px vs 2px, so I'm definitely not seeing text, just fat blue dots.) I don't need timestamps etc., but "elbow", "eye", etc. would be soooo good.

The code used for drawing out the keypoints is currently only drawing circles.

def displayFrame(name, frame, points):
        color = (255, 0, 0)
        try:
            for point in points:
                point = point[0]
                cv2.circle(frame, (int(point[1]*2), int(point[2]*2)), 2, color, cv2.FILLED)
            # Show the frame
            cv2.imshow(name, frame)
        except Exception as e:
            print(e)

This point is a list of [index, x, y, confidence]. You would need to map the indices 0-17 to actual joints. Maybe the mapping is described here somewhere. You might need to do it manually though.

GaryG Any additional help at streamlining the code (especially), and improving performance most appreciated. I do plan to move this to a pi4, and can drop the framerate even more if it would help.

Maybe you could use a smaller model or one with smaller input size, which would result in better overall performance.

Thanks,
Jaka

GaryG

Thanks for that input. I'll send links to some of the code I've seen online. Might not be today though. I'll keep in mind about limitations on what models can be used. I know some of the things I looked at involved OpenVINO.

Gary

GaryG

@Jaka, sorry for going dark, but I wanted to thank you very much for your help.

With your code, slightly modified, they were able to present big blue dots where body nodes were identified, e.g., wrist, elbow, shoulder. I helped them write some simple Python code to convert two [x,y] tuples into an angle from the horizontal, and had them read those values from a file. Being able to demo the camera finding the points in real time and demoing some simple Python code that they understood was really all they needed, and they were able to explain it well to the judges. So while I don't know how they scored, they all felt really good about themselves this past Saturday.

So, thanks again for your help!!

Gary