multiperson pose estimation

TedHerman

Around a year ago, I posted a question about using the SDK for human pose estimation (multiple persons). At that time, the answer was that multipose is not feasible. Since then, there are some new developments that give me hope, at least for our modest application. Here are some experiences from revisiting the topic.

The Pi 5 is significantly faster than the Pi 4, which I'd used previously. It may be possible to use the OAK to detect persons and get distances, plus feed a subset of the same images to a neural net on the Pi 5 and somehow merge the results (yes, of course timing will degrade the fusion).
Ultralytics has a Yolo pose estimation, which works for multiple persons. The quality isn't as good as the previous models I tried with DepthAI, but it might be good enough. I experimented with the Yolo 11 model on the Pi 5 and it consumed CPU between a third and half second for preprocess plus inference. Not great, but perhaps my application can still use this coarse sampling. I also tried their Yolo v8 pose model - it should be able to run on the OAK, yes? The model/net size appears to be 348x640. Update: I downsized model to 192x320 and got around 9 fps, with almost the same quality of pose estimation, and with multipose.
OpenVino has a new API (2.0), and I couldn't get things working on the Pi 5; at least it's not obvious how to make Zoo models work with the new API. The way the Pi 5 is set up, it would be difficult to retrofit to a previous level of support. I did try geaxgx/openvino_movenet_multipose again, and after some fiddling to get the new 2.0 API calls in place, it ran on my Intel desktop; however it failed to run on the Pi 5 (with no crash, just silently not classifying images). I don't have any idea how to debug this. I think the openvino multipose is better than Yolo, if it could work.

jakaskerl

TedHerman The Pi 5 is significantly faster than the Pi 4, which I'd used previously. It may be possible to use the OAK to detect persons and get distances, plus feed a subset of the same images to a neural net on the Pi 5 and somehow merge the results (yes, of course timing will degrade the fusion).

Yes. It's fully possible, however as you have said, timing will be off by a second (depending on the complexity of the model).

TedHerman . I also tried their Yolo v8 pose model - it should be able to run on the OAK, yes?

Yes, but decoding is currently not supported so you will have to do it yourself.

TedHerman I don't have any idea how to debug this. I think the openvino multipose is better than Yolo, if it could work.

I can't really help if you are running the model on the CPU, also I am not familiar with the 2.0API. Likely best to use GPT for this task as it is probably just some bug in the code.

Thanks,
Jaka

TedHerman

jakaskerl GPT didn't really help solving my openvino multipose problem. I did get the Yolo multipose running on the OAK. Some observations from that:

With the Ultralytics yolo tool, I scaled the NN to 320, but was unable to use other tools to convert to a blob (due to version/distro conflicts). Thanks Luxonis for your online converter - that did the trick.
As you suggested, I had to do all the decoding myself: the blob is not loaded as Yolo, nor do I get the benefit of automated post-processing, so my code has to filter through around 2K bounding boxes (including some hand-crafted NMS-like stuff). Decoding keypoints was easy.
My first version just read an mp4, then fed frames via XLinkIn to the NN. This had the advantage of making it easy to scale each frame to a square 320x320 shape, which is what the NN requires. Subsequently after decoding, it's simple to do reverse scaling and draw on the frame for display. The Yolo model is still not as good as the geaxgx openvino multipose, though perhaps good enough (fingers crossed).
My second version (still a work in progress) got images from the OAK, sent them to the NN, then read both NN output and preview output for decoding, drawing, and display. Here there are some design choices.
One way is to setPreviewSize to the square shape for the NN. This works pretty well, and the ideas from the first version carry over. The downside is that the OAK is using about half its sensor area for pose estimation: the display window is square.
Another way is to setPreviewSize to something like the aspect ratio of the Isp (wide view), wire the output to ImageManip, which changes the shape to square. This did work, the NN was estimating poses over the whole rectangular display. But rescaling bounding boxes and keypoints is a problem! We can't see what ImageManip is doing. Ultimately I did get something working, but I don't really understand why it works.
The result runs around 10fps, which is decent enough for what we need.