I've received an interesting question about whether it'd be possible to update https://github.com/geaxgx/depthai_hand_tracker so it'd work with multiple people instead of just one. In my opinion, while it would be possible, it might be too slow. Body-prefocusing works by running person detection first, then it finds palms, and then hand tracker (palm keypoint/landmarks detection). The demo then doesn't run any of the first 2 NNs, only hand tracker NN, and it actually changes ImageManip crop rectangle based on where keypoints are on the frame, as shown on this flowchart (it's without the body prefocusing, which does person detection first): ![flow](https://github.com/geaxgx/depthai_hand_tracker/blob/main/img/schema_hand_tracking.png?raw=true) As hand tracker NN runs fast enough (and hands don't move that fast), you can get away with just running inference on 1 model, instead of all 3. Adding support for multiple people wouldn't be that hard by default (looping through each person found, running hand detection NN, then hand tracker NN). But since AI resources would be split between 7 hands, it means FPS would also be one-seventh of what it previously was. That means that moving the hand even slowly would likely mean it wouldn't be found the next inference, which means the app would need to re-do the whole 3-NN based process to find all palms. TL;DR while it would be possible to add multi-person support for hand-tracker demo, it would likely be too slow.

@"erik"#p13580 Sorry I misread the original idea, that makes sense. I understand that the current implementation would most likely be too slow to be useful. I'm looking more into the issues and if I understand correctly one of the main issues is the downscaling of fp16 from fp32(as well as some unsupported operations). Because of these things I've went back to research. During my research on the subject I stumbled on the newest version of [MMPose](https://github.com/open-mmlab/mmpose) which features multiple workflows for [COCO whole body](https://github.com/open-mmlab/mmpose/tree/main/projects%2Frtmpose) and appears to be readily convertible to TensorRT's fp16 format(though I don't know if this is relevant to Myriadx). I was able to test it on CPU and got rather remarkable results. I'm not asking for a definitive answer but does it seem possible for these to be used?

@"KlemenSkrlj"#862 do you mind taking a look?

Multi-person Hand Tracker

TheHiddenWaffle

Hey erik I'm working multi person hand tracking based on this project right now, I was wondering if it would be possible to substitute out the palm detection for short term object tracking? In the worst case the tracklets would fail and you'd have to rerun palm detection but could this potentially be enough of a performance gain in situations where the hands aren't moving at high speeds to bring us up to the 10-15 fps region? Also to this end do you have a MyriadX blob for movenet multipose(preferably 192x256 as my application is pretty close quarters) on hand? I tried(not very hard) to convert the model over at https://github.com/geaxgx/openvino_movenet_multipose but I hit an error and decided to come here and research instead of falling down a debugging/adding nodes hole.

I've also considered the idea of running the hand tracking round robin in order to spread out the processing requirements, any thoughts on this? Of course this means multiplying the latency per human tracked as opposed to dividing the frame rate but it could at least be acceptable for another subset of applications(such as my own).

erik

Hi TheHiddenWaffle ,
How would tracking help with palm detection? PIpeline runs detection only once, then it's running landmark detection and crops the ROI for it dynamically, without running hand detection until it looses landmarks.

WRT multi-person, I don't think it works on RVC2 (myriad x), more info here: geaxgx/depthai_movenet3

TheHiddenWaffle

erik Sorry I misread the original idea, that makes sense. I understand that the current implementation would most likely be too slow to be useful.

I'm looking more into the issues and if I understand correctly one of the main issues is the downscaling of fp16 from fp32(as well as some unsupported operations).

Because of these things I've went back to research.

During my research on the subject I stumbled on the newest version of MMPose which features multiple workflows for COCO whole body and appears to be readily convertible to TensorRT's fp16 format(though I don't know if this is relevant to Myriadx). I was able to test it on CPU and got rather remarkable results. I'm not asking for a definitive answer but does it seem possible for these to be used?

erik

Hi @TheHiddenWaffle ,
It mostly depends on the NN operations (layers). Could you share the link to the lightest model weights (onnx/.pt..)? We can try to convert them and deploy them to the device.
Thanks, Erik

TheHiddenWaffle

erik

from here

This appears to be the only model of the RTM Whole body models which provides small/tiny versions, the first entry(t) seems to only have a broken download link for the ONNX model(as the onnx link points to a pth file which is not available) and therefore the pth is the only one available whereas the small has both a pth and ONNX version. Hope this information helps.

tiny pth: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmpose-t_simcc-ucoco_dw-ucoco_270e-256x192-dcf277bf_20230728.pth

small ONNX: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/onnx_sdk/rtmpose-s_simcc-ucoco_dw-ucoco_270e-256x192-3fd922c8_20230728.zip

small pth: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmpose-s_simcc-ucoco_dw-ucoco_270e-256x192-3fd922c8_20230728.pth

Then the model structure is described by the tiny config and the small config

Really interested to hear the results of this, being able to consume from the OpenMM projects in a more standardized way could be huge for expanding the DepthAI model zoo and provide a lot of really good models for future DAI projects and I would love to aid in this effort in any manner I can.

TheHiddenWaffle

erik Thanks again for the assistance and willingness to help, both you and Luxonis are doing such impactful work in this field that never ceases to amaze me 🙏

erik

@Matija , could someone from the ML team try to convert this model and deploy it to rvc2? The openpose2 that we have a demo for is quite slow, perhaps this would be a good alternative.

Matija

@KlemenSkrlj do you mind taking a look?

KlemenSkrlj

Hi @TheHiddenWaffle
There were two problems with the ONNX that you've provided:
- Output shapes included dynamic variables where as blobconverter need fixed shape
- .onnx model included HardSigmoid op block which is not supported by AI accelerator on OAK so we have to replace it with combination of more basic operations

I'm attaching the fixed models (.onnx and .blob) together with simple inference scripts to test them out - link. It also includes a README with more information and step by step instructions.
If you have any additional questions feel free to reach out.

TheHiddenWaffle

KlemenSkrlj thank you so much, I appreciate the steps too as at some point I'd like to train the MMPose model further.

TheHiddenWaffle

KlemenSkrlj I decided to pull over the JointFormer wholebody version shown in wholebody3d/wholebody3d in order to get 3d lifted pose from the rtmpose results, I was able to run inference and export to onnx but on converting to MyriadX blob I got the following:

[ ERROR ] Check 'unknown_operators.empty()' failed at frontends/onnx/frontend/src/core/graph.cpp:131: OpenVINO does not support the following ONNX operations: LayerNormalization [ ERROR ] Traceback (most recent call last): File "/app/venvs/venv2022_1/lib/python3.8/site-packages/openvino/tools/mo/main.py", line 533, in main ret_code = driver(argv) File "/app/venvs/venv2022_1/lib/python3.8/site-packages/openvino/tools/mo/main.py", line 489, in driver graph, ngraph_function = prepare_ir(argv) File "/app/venvs/venv2022_1/lib/python3.8/site-packages/openvino/tools/mo/main.py", line 394, in prepare_ir ngraph_function = moc_pipeline(argv, moc_front_end) File "/app/venvs/venv2022_1/lib/python3.8/site-packages/openvino/tools/mo/moc_frontend/pipeline.py", line 147, in moc_pipeline ngraph_function = moc_front_end.convert(input_model) RuntimeError: Check 'unknown_operators.empty()' failed at frontends/onnx/frontend/src/core/graph.cpp:131: OpenVINO does not support the following ONNX operations: LayerNormalization

It seems clear that this is the result of LayerNormalization not being supported, do you have an easy replacement for that op similar to what you described for HardSigmoid? Here is the ONNX model if you'd like to take a look.

Also is there a document/knowledge base for such replacements or is it just reading what that layer does and replicating it with lower level ops?

TheHiddenWaffle

TheHiddenWaffle Would converting the LayerNormalizations to MVN layers work?

KlemenSkrlj

TheHiddenWaffle Hi, LayerNormalization op is supported only from onnx 17 and up (seen here). If you change the opset during export to something older, ONNX will automatically replace this block with lower level ops. I would suggest using opset_version=11 in most cases when exporting since those ops should be also all supported on the device.
These are the ops that replace one LayerNorm block:

TheHiddenWaffle

Also @erik the mmpose wholebody demo follows this structure:

I realize that this implementation can't be done exactly because the myriad blobs need to have a fixed batch size, is there some sort of workaround that could be performed? I was considering the idea of just having a model with batch size equal to my maximum body count(~10) and then just feeding the remaining batch numbers where boxes<batch_size fake data and ignoring the results of those detections, though I'm unsure of whether this would dramatically reduce performance in lower body counts.

Additionally the demo in mmpose uses rtmdet_nano_320-8xb32_coco-person for person segmentation, will I need that to increase the speed of the pipeline overall or is there a yolo tiny model that would do just fine at a similar speed? I ask because I tried to convert it already but got some errors and figured I would ask before taking it further.

Open to suggestions on all accounts.

TheHiddenWaffle

whoops wrong tag @erik, edited

erik

Hi @TheHiddenWaffle
Another option for batch size would be to just set it to 1, and if there are multiple objects (bodies), it will inference one by one.
WRT rtmdet_nano - depends on the speed of the model vs the speed of some yolo model or person-detection model (which we use is our demos).

TheHiddenWaffle

erik sounds good, I'll experiment with different input sizes and batch sizes and look at how they affect latency and processing time on a per detection basis.

WRT the person tracking I'll just settle in with the models Luxonis currently provides and if it seems like it's becoming a bottleneck I'll deal with it later on. Thanks again for the feedback/insight.

TheHiddenWaffle

erik So I tried out 6 shave cores, 2 inference threads at batch sizes 1-5 and wasn't able to figure out how to actually run batched inference(even after flattening a [n,3,256,192] tensor and then attaching it to a Buffer I still only get back [1, 133, 384] and [1, 133, 512] results), the average round trip for inference time was 69-261 ms(average of 40 frames)

It appears that batch size 2 would be an excellent trade off of latency/throughput, though again that's if I can figure out how to actually get batched results back from batched inference. I looked on other posts but didn't see similar issues?

Edit: I think I'll just change the graph to have 2 inputs similar to this, each being [1,3,256,192] and then Concat them to a [2,3,256,192] within the graph so that I can link the inputs straight up to ImgManip nodes.

TheHiddenWaffle

KlemenSkrlj Sounds good, I used opset 11 and got

2 input rank provided to /intermediate_pred.0/intermediate_pred.0.0/Div_output_0 layer, but only 3D and 4D supported.

from the blob converter, is there an option that needs to be supplied to let the it know that this is not a NN that accepts images?

TheHiddenWaffle

TheHiddenWaffle do you have any insights or workarounds that could solve this? I've thought about it the last couple days but the only solution I could come up with was concatenating an additional dummy row on and then removing it after but that's feels like it could get me into trouble considering how little I understand a out how Jointformer works.

TheHiddenWaffle

TheHiddenWaffle Made the changes I outlined in my edit, my averages for the double input blob came out to 120 ms, which is still faster by 16% on a per-detection basis but the cost is increasing latency by 73%, and of course cutting the frame rate almost in half(vs the single body model) when only 1 body is present to process. here's the model if anyone on this thread is interested but I don't expect many would be. I also tested 3 inputs but there was no meaningful increase in any performance metric, and the latency jumped another 60 ms.