• DepthAI-v2
  • Having trouble with results from human-pose-estimation-0001

I'm trying to get something to a point where some middle school students can experiment with it, and I'd be able to hand it off except the data isn't making sense. Plus, I'd need to correlate the detection points with their interpretation, e.g., elbow, shoulder, etc.

The thing that has me 100% stuck right now is getting near-zero values for xmax and ymax. If I can fix that, at least I can show it to them tomorrow.

I'll post my Python code, a portion of my shell output, and a screen shot. If there's anything else I can provide, or provide differently, please let me know.

Thanks!

# The starting point for this was:
# https://github.com/luxonis/depthai-tutorials/blob/master/1-hello-world/hello_world.py
# Description is at https://docs.luxonis.com/projects/api/en/latest/tutorials/hello_world

# first, import all necessary modules
from pathlib import Path

import blobconverter
import cv2
import depthai
import numpy as np

import inspect

# Pipeline tells DepthAI what operations to perform when running - you define all of the resources used and flows here
pipeline = depthai.Pipeline()
print("pipeline =", pipeline)

# First, we want the Color camera as the output
cam_rgb = pipeline.createColorCamera()
cam_rgb.setPreviewSize(456, 256)  # the preview frame size, available as 'preview' output of the node # 0001
#cam_rgb.setPreviewSize(448, 448)  # the preview frame size, available as 'preview' output of the node # 0007
cam_rgb.setInterleaved(False)
#cam_rgb.setFps(1)

# Next, we want a neural network that will produce the detections
detection_nn = pipeline.createMobileNetDetectionNetwork()
# Blob is the Neural Network file, compiled for MyriadX. It contains both the definition and weights of the model
# We're using a blobconverter tool to retreive the MobileNetSSD blob automatically from OpenVINO Model Zoo
detection_nn.setBlobPath(blobconverter.from_zoo(name='human-pose-estimation-0001', shaves=6))
# Next, we filter out the detections that are below a confidence threshold. Confidence can be anywhere between <0..1>
detection_nn.setConfidenceThreshold(0.5)
# Next, we link the camera 'preview' output to the neural network detection input, so that it can produce detections
cam_rgb.preview.link(detection_nn.input)

# XLinkOut is a "way out" from the device. Any data you want to transfer to host need to be send via XLink
xout_rgb = pipeline.createXLinkOut()
# For the rgb camera output, we want the XLink stream to be named "rgb"
xout_rgb.setStreamName("rgb")
# Linking camera preview to XLink input, so that the frames will be sent to host
cam_rgb.preview.link(xout_rgb.input)

# The same XLinkOut mechanism will be used to receive nn results
xout_nn = pipeline.createXLinkOut()
xout_nn.setStreamName("nn")
detection_nn.out.link(xout_nn.input)

# Pipeline is now finished, and we need to find an available device to run our pipeline
# we are using context manager here that will dispose the device after we stop using it
with depthai.Device(pipeline) as device:
    # From this point, the Device will be in "running" mode and will start sending data via XLink

    # To consume the device results, we get two output queues from the device, with stream names we assigned earlier
    q_rgb = device.getOutputQueue("rgb")
    q_nn = device.getOutputQueue("nn")

    # Here, some of the default values are defined. Frame will be an image from "rgb" stream, detections will contain nn results
    frame = None
    detections = []

    # Since the detections returned by nn have values from <0..1> range, they need to be multiplied by frame width/height to
    # receive the actual position of the bounding box on the image
    def frameNorm(frame, bbox):
        normVals = np.full(len(bbox), frame.shape[0])
        normVals[::2] = frame.shape[1]
        return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)


    # Main host-side application loop
    while True:
        # we try to fetch the data from nn/rgb queues. tryGet will return either the data packet or None if there isn't any
        in_rgb = q_rgb.tryGet()
        in_nn = q_nn.tryGet()

        if in_rgb is not None:
            # If the packet from RGB camera is present, we're retrieving the frame in OpenCV format using getCvFrame
            frame = in_rgb.getCvFrame()

            if frame is not None and in_nn is not None:
                # when data from nn is received, we take the detections array that contains mobilenet-ssd results
                detections = in_nn.detections
                if len(detections) > 0:
                    #print("detections=", detections)
                    for detection in detections:
                        label = detection.label
                        confidence = detection.confidence
                        xmin = detection.xmin
                        ymin = detection.ymin
                        xmax = detection.xmax
                        ymax = detection.ymax
                        #print(detection)
                        print("label =", label, "; confidence =", confidence, "; xmin =", xmin, "; ymin =", ymin, "; xmax =", xmax, "; ymax =", ymax)
                        bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
                        cv2.line(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 0, 0), 2)
                        # After the drawing is finished, we show the frame on the screen
                        cv2.imshow("preview", frame)

        # at any time, you can press "q" and exit the main loop, therefore exiting the program itself
        if cv2.waitKey(1) == ord('q'):
            break
label = 0 ; confidence = 0.74169921875 ; xmin = 0.7783203125 ; ymin = 0.246826171875 ; xmax = 0.00054931640625 ; ymax = 0.00054931640625
label = 0 ; confidence = 0.60107421875 ; xmin = 0.42529296875 ; ymin = 0.105712890625 ; xmax = 0.03570556640625 ; ymax = 0.0181884765625
label = 0 ; confidence = 0.6875 ; xmin = 0.4169921875 ; ymin = 0.0377197265625 ; xmax = 0.00335693359375 ; ymax = 0.0025634765625
label = 0 ; confidence = 0.84716796875 ; xmin = 0.046630859375 ; ymin = 0.0 ; xmax = 6.103515625e-05 ; ymax = 0.0006103515625
label = 0 ; confidence = 0.837890625 ; xmin = 0.92333984375 ; ymin = 0.05877685546875 ; xmax = 6.103515625e-05 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.52783203125 ; xmin = 0.10552978515625 ; ymin = 0.00970458984375 ; xmax = 0.01348876953125 ; ymax = 0.00103759765625
label = 0 ; confidence = 0.9228515625 ; xmin = 0.92919921875 ; ymin = 0.91064453125 ; xmax = 0.8486328125 ; ymax = 0.6025390625
label = 0 ; confidence = 0.64892578125 ; xmin = 0.6376953125 ; ymin = 0.1900634765625 ; xmax = 0.01153564453125 ; ymax = 0.000244140625
label = 0 ; confidence = 0.91943359375 ; xmin = 0.6884765625 ; ymin = 0.02935791015625 ; xmax = 0.000244140625 ; ymax = 0.000244140625
label = 0 ; confidence = 0.58447265625 ; xmin = -6.103515625e-05 ; ymin = -6.103515625e-05 ; xmax = -6.103515625e-05 ; ymax = -6.103515625e-05
label = 0 ; confidence = 0.97265625 ; xmin = 0.57958984375 ; ymin = -6.103515625e-05 ; xmax = -6.103515625e-05 ; ymax = -6.103515625e-05
label = 0 ; confidence = 0.64794921875 ; xmin = 0.41455078125 ; ymin = 0.00909423828125 ; xmax = 0.00018310546875 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.52392578125 ; xmin = 0.390380859375 ; ymin = 0.00244140625 ; xmax = 0.0081787109375 ; ymax = -0.00048828125
label = 0 ; confidence = 0.8427734375 ; xmin = 0.06243896484375 ; ymin = 0.00213623046875 ; xmax = 0.003173828125 ; ymax = 0.0030517578125
label = 0 ; confidence = 0.93603515625 ; xmin = 0.77978515625 ; ymin = 0.02655029296875 ; xmax = -0.00146484375 ; ymax = 0.00042724609375
label = 0 ; confidence = 0.56884765625 ; xmin = 0.07354736328125 ; ymin = 0.00042724609375 ; xmax = -0.01214599609375 ; ymax = -0.01409912109375
label = 0 ; confidence = 0.8115234375 ; xmin = 0.464111328125 ; ymin = 0.0447998046875 ; xmax = 0.00042724609375 ; ymax = -0.00469970703125
label = 0 ; confidence = 0.619140625 ; xmin = 0.76171875 ; ymin = 0.308349609375 ; xmax = 0.01397705078125 ; ymax = 0.00042724609375
label = 0 ; confidence = 0.50341796875 ; xmin = 0.703125 ; ymin = 0.154296875 ; xmax = 0.0057373046875 ; ymax = 0.00653076171875
label = 0 ; confidence = 0.91357421875 ; xmin = 0.369873046875 ; ymin = 0.00054931640625 ; xmax = 0.00054931640625 ; ymax = 0.0015869140625
label = 0 ; confidence = 0.64599609375 ; xmin = 0.06475830078125 ; ymin = 0.00054931640625 ; xmax = 0.00054931640625 ; ymax = 0.0006103515625
label = 0 ; confidence = 0.79052734375 ; xmin = 0.767578125 ; ymin = 0.193603515625 ; xmax = 0.00054931640625 ; ymax = 0.00054931640625
label = 0 ; confidence = 0.6181640625 ; xmin = 0.424560546875 ; ymin = 0.09307861328125 ; xmax = 0.0433349609375 ; ymax = 0.0252685546875
label = 0 ; confidence = 0.681640625 ; xmin = 0.384521484375 ; ymin = 0.025390625 ; xmax = 0.00335693359375 ; ymax = 0.00225830078125
label = 0 ; confidence = 0.8330078125 ; xmin = 0.042236328125 ; ymin = -6.103515625e-05 ; xmax = 6.103515625e-05 ; ymax = 0.000732421875
label = 0 ; confidence = 0.8486328125 ; xmin = 0.9111328125 ; ymin = 0.0548095703125 ; xmax = 6.103515625e-05 ; ymax = 6.103515625e-05
label = 0 ; confidence = 0.5390625 ; xmin = 0.08209228515625 ; ymin = 0.00091552734375 ; xmax = 0.010986328125 ; ymax = 0.00189208984375
label = 0 ; confidence = 0.9208984375 ; xmin = 0.931640625 ; ymin = 0.9267578125 ; xmax = 0.85888671875 ; ymax = 0.603515625
label = 0 ; confidence = 0.63818359375 ; xmin = 0.611328125 ; ymin = 0.1787109375 ; xmax = 0.012451171875 ; ymax = 0.000244140625
label = 0 ; confidence = 0.91064453125 ; xmin = 0.6923828125 ; ymin = 0.03057861328125 ; xmax = 0.000244140625 ; ymax = 0.000244140625

    Hi GaryG
    The model you are using is not MBnet. Anyway, GPT and I wrote you a custom decoding function. Now you have a list of points returned from the model, which you can use to draw out the skeleton. Keep in mind the points x, y (i multiplied them by 2 to fit the frame since I didn't write a scaling function).

    # The starting point for this was:
    # https://github.com/luxonis/depthai-tutorials/blob/master/1-hello-world/hello_world.py
    # Description is at https://docs.luxonis.com/projects/api/en/latest/tutorials/hello_world
    
    # first, import all necessary modules
    from pathlib import Path
    
    import blobconverter
    import cv2
    import depthai
    import numpy as np
    import time
    import argparse
    import depthai as dai
    import inspect
    
    import numpy as np
    
    threshold = 0.4
    
    def parse_heatmaps(heatmaps):
        keypoints = []
        for heatmap in heatmaps:
            # Find the pixel with the highest value
            y, x = np.unravel_index(np.argmax(heatmap), heatmap.shape)
            confidence = heatmap[y, x]
            if confidence > threshold:  # Define a threshold value
                keypoints.append((x, y, confidence))
        return keypoints
    
    from scipy.ndimage.filters import maximum_filter
    
    def non_maximum_suppression(heatmaps, window_size=3):
        # Apply a maximum filter
        max_filtered = maximum_filter(heatmaps, size=window_size)
        # Keep only the points that are equal to the local maximum
        peaks = (heatmaps == max_filtered) & (heatmaps > threshold)
        return peaks
    
    def group_keypoints(keypoints, embeddings):
        poses = []
        # This is a simplified version. In practice, you'd use a more complex algorithm
        # to group keypoints based on the distance between their embeddings.
        for keypoint, embedding in zip(keypoints, embeddings):
            # Find which pose this keypoint belongs to based on the embedding
            pose_id = find_closest_pose(embedding, poses)
            if pose_id is not None:
                poses[pose_id].append(keypoint)
            else:
                poses.append([keypoint])
        return poses
    
    def find_closest_pose(embedding, poses):
        # Implement a method to find the closest pose based on embeddings
        pass
    
    
    
    
    # Create pipeline
    pipeline = dai.Pipeline()
    
    # Define sources and outputs
    camRgb = pipeline.create(dai.node.ColorCamera)
    nn = pipeline.create(dai.node.NeuralNetwork)
    xoutRgb = pipeline.create(dai.node.XLinkOut)
    nnOut = pipeline.create(dai.node.XLinkOut)
    
    xoutRgb.setStreamName("rgb")
    nnOut.setStreamName("nn")
    
    
    # Properties
    camRgb.setPreviewSize(288, 288)
    camRgb.setInterleaved(False)
    camRgb.setFps(40)
    # Define a neural network that will make predictions based on the source frames
    nn.setBlobPath(blobconverter.from_zoo(name='human-pose-estimation-0005', shaves=6))
    nn.setNumInferenceThreads(2)
    nn.input.setBlocking(False)
    
    # Linking
    
    nn.passthrough.link(xoutRgb.input)
    
    camRgb.preview.link(nn.input)
    nn.out.link(nnOut.input)
    
    
    # Connect to device and start pipeline
    with dai.Device(pipeline) as device:
    
        # Output queues will be used to get the rgb frames and nn data from the outputs defined above
        qRgb = device.getOutputQueue(name="rgb", maxSize=4, blocking=False)
        qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False)
    
        frame = None
        detections = []
        startTime = time.monotonic()
        counter = 0
        color2 = (255, 255, 255)
    
        # nn data (bounding box locations) are in <0..1> range - they need to be normalized with frame width/height
        def frameNorm(frame, bbox):
            normVals = np.full(len(bbox), frame.shape[0])
            normVals[::2] = frame.shape[1]
            return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)
    
        def displayFrame(name, frame, points):
            color = (255, 0, 0)
            try:
                for point in points:
                    point = point[0]
                    cv2.circle(frame, (int(point[1]*2), int(point[2]*2)), 2, color, cv2.FILLED)
                # Show the frame
                cv2.imshow(name, frame)
            except Exception as e:
                print(e)
    
        printOutputLayersOnce = True
    
        while True:
            inRgb = qRgb.get()
            inDet = qDet.get()
    
            if inRgb is not None:
                frame = inRgb.getCvFrame()
                cv2.putText(frame, "NN fps: {:.2f}".format(counter / (time.monotonic() - startTime)),
                            (2, frame.shape[0] - 4), cv2.FONT_HERSHEY_TRIPLEX, 0.4, color2)
    
            
    
            # Assuming `heatmaps_output` and `embeddings_output` are the outputs from the model
            heatmaps_output = np.array(inDet.getLayerFp16("heatmaps")).reshape((1, 17, 144, 144))
            embeddings_output = np.array(inDet.getLayerFp16("embeddings")).reshape((1, 17, 144, 144))
    
            print(heatmaps_output.shape)
            print(embeddings_output.shape)
    
            # Parse the heatmaps to get initial keypoints
            keypoints = parse_heatmaps(heatmaps_output[0])
    
            # Apply non-maximum suppression to filter the keypoints
            peaks = non_maximum_suppression(heatmaps_output[0])
    
            # Now, we need to extract the keypoints from the peaks
            filtered_keypoints = []
            for i, peak in enumerate(peaks):
                y, x = np.where(peak)
                for j in range(len(x)):
                    confidence = heatmaps_output[0][i, y[j], x[j]]
                    filtered_keypoints.append((i, x[j], y[j], confidence))
    
            # Group keypoints into individual poses
            poses = group_keypoints(filtered_keypoints, embeddings_output[0])
    
    
            
            # Now `poses` contains grouped keypoints for each detected person
    
            # If the frame is available, draw bounding boxes on it and show the frame
            if frame is not None:
                displayFrame("rgb", frame, points=poses)
    
            if cv2.waitKey(1) == ord('q'):
                break

    Hopefully you can now complete your code to show it to the students 🙂
    Let me know if you need anything.

    Thanks,
    Jaka

      jakaskerl Thank you for your help! I only had time to paste our code in and give it a quick try. Due to the import of scipy I get an error about not finding libopenblas.so.0.

        GaryG
        Maybe try to install a different version of scipy, or just install openblas manually.

        @jakaskerl that helped.

        Things I did:

        • installed libopenblas-dev as seen in history
        • I'm teaching them with Thonny so I let Thonny (re)install scipy after the "global" libopenblas-dev install
        • Changed namespace for maximum_filter to avoid deprecation warning. (This might have helped with functionality, not sure.)
        • I'm on a Pi3, so fps 40 was extremely wishful thinking.

        Sorry if this is disjointed; it's late and I start early.

        Great that I have something to show tomorrow. But there's a lot more here than I can hope to explain, so I'll need to revisit this for next Wednesday. Hoping to keep it simpler like some of the online articles suggest it can be.

        Also, we can assume one person only if that helps. (Their desired use case is giving someone feedback about a martial arts pose. I don't think it's Tai Chi but not sure what :-) I'm only an ad hoc volunteer offering to help a teacher at a different school in the district than the one my son goes to.)

        I think there's supposed to be some text labeling but I don't see any (I made the text red vs blue -- which I made 3px vs 2px, so I'm definitely not seeing text, just fat blue dots.) I don't need timestamps etc., but "elbow", "eye", etc. would be soooo good.

        Any additional help at streamlining the code (especially), and improving performance most appreciated. I do plan to move this to a pi4, and can drop the framerate even more if it would help.

        I can push more code into functions myself to make it clearer if necessary, and just treat the functions as black boxes. But anything we can actually get rid of or get from a library is better. "Less is more."

        Gary

          263  apt search libopenblas
          264  sudo apt install libopenblas-dev
          265  pip uninstall scipy
        
        31c29
        < from scipy.ndimage.filters import maximum_filter
        ---
        > from scipy.ndimage import maximum_filter
        
        76c70
        < camRgb.setFps(40)
        ---
        > camRgb.setFps(6)

          GaryG Hoping to keep it simpler like some of the online articles suggest it can be.

          source? The Oak devices can only run on-device decoding for models like Yolo and MobileNetSSD. Others have to be decoded on host side.

          GaryG I think there's supposed to be some text labeling but I don't see any (I made the text red vs blue -- which I made 3px vs 2px, so I'm definitely not seeing text, just fat blue dots.) I don't need timestamps etc., but "elbow", "eye", etc. would be soooo good.

          The code used for drawing out the keypoints is currently only drawing circles.

          def displayFrame(name, frame, points):
                  color = (255, 0, 0)
                  try:
                      for point in points:
                          point = point[0]
                          cv2.circle(frame, (int(point[1]*2), int(point[2]*2)), 2, color, cv2.FILLED)
                      # Show the frame
                      cv2.imshow(name, frame)
                  except Exception as e:
                      print(e)

          This point is a list of [index, x, y, confidence]. You would need to map the indices 0-17 to actual joints. Maybe the mapping is described here somewhere. You might need to do it manually though.

          GaryG Any additional help at streamlining the code (especially), and improving performance most appreciated. I do plan to move this to a pi4, and can drop the framerate even more if it would help.

          Maybe you could use a smaller model or one with smaller input size, which would result in better overall performance.

          Thanks,
          Jaka

          Thanks for that input. I'll send links to some of the code I've seen online. Might not be today though. I'll keep in mind about limitations on what models can be used. I know some of the things I looked at involved OpenVINO.

          Gary

          25 days later

          @Jaka, sorry for going dark, but I wanted to thank you very much for your help.

          With your code, slightly modified, they were able to present big blue dots where body nodes were identified, e.g., wrist, elbow, shoulder. I helped them write some simple Python code to convert two [x,y] tuples into an angle from the horizontal, and had them read those values from a file. Being able to demo the camera finding the points in real time and demoing some simple Python code that they understood was really all they needed, and they were able to explain it well to the judges. So while I don't know how they scored, they all felt really good about themselves this past Saturday.

          So, thanks again for your help!!

          Gary