I've received an interesting question about whether it'd be possible to update https://github.com/geaxgx/depthai_hand_tracker so it'd work with multiple people instead of just one. In my opinion, while it would be possible, it might be too slow. Body-prefocusing works by running person detection first, then it finds palms, and then hand tracker (palm keypoint/landmarks detection). The demo then doesn't run any of the first 2 NNs, only hand tracker NN, and it actually changes ImageManip crop rectangle based on where keypoints are on the frame, as shown on this flowchart (it's without the body prefocusing, which does person detection first): ![flow](https://github.com/geaxgx/depthai_hand_tracker/blob/main/img/schema_hand_tracking.png?raw=true) As hand tracker NN runs fast enough (and hands don't move that fast), you can get away with just running inference on 1 model, instead of all 3. Adding support for multiple people wouldn't be that hard by default (looping through each person found, running hand detection NN, then hand tracker NN). But since AI resources would be split between 7 hands, it means FPS would also be one-seventh of what it previously was. That means that moving the hand even slowly would likely mean it wouldn't be found the next inference, which means the app would need to re-do the whole 3-NN based process to find all palms. TL;DR while it would be possible to add multi-person support for hand-tracker demo, it would likely be too slow.

Hey @"erik"#p12463 I'm working multi person hand tracking based on this project right now, I was wondering if it would be possible to substitute out the palm detection for short term object tracking? In the worst case the tracklets would fail and you'd have to rerun palm detection but could this potentially be enough of a performance gain in situations where the hands aren't moving at high speeds to bring us up to the 10-15 fps region? Also to this end do you have a MyriadX blob for movenet multipose(preferably 192x256 as my application is pretty close quarters) on hand? I tried(not very hard) to convert the model over at https://github.com/geaxgx/openvino_movenet_multipose but I hit an error and decided to come here and research instead of falling down a debugging/adding nodes hole. I've also considered the idea of running the hand tracking round robin in order to spread out the processing requirements, any thoughts on this? Of course this means multiplying the latency per human tracked as opposed to dividing the frame rate but it could at least be acceptable for another subset of applications(such as my own).

Hi @"TheHiddenWaffle"#p13576 , How would tracking help with palm detection? PIpeline runs detection only once, then it's running landmark detection and crops the ROI for it dynamically, without running hand detection until it looses landmarks. WRT multi-person, I don't think it works on RVC2 (myriad x), more info here: https://github.com/geaxgx/depthai_movenet/issues/3

@"erik"#p13580 Sorry I misread the original idea, that makes sense. I understand that the current implementation would most likely be too slow to be useful. I'm looking more into the issues and if I understand correctly one of the main issues is the downscaling of fp16 from fp32(as well as some unsupported operations). Because of these things I've went back to research. During my research on the subject I stumbled on the newest version of [MMPose](https://github.com/open-mmlab/mmpose) which features multiple workflows for [COCO whole body](https://github.com/open-mmlab/mmpose/tree/main/projects%2Frtmpose) and appears to be readily convertible to TensorRT's fp16 format(though I don't know if this is relevant to Myriadx). I was able to test it on CPU and got rather remarkable results. I'm not asking for a definitive answer but does it seem possible for these to be used?

Hi @"TheHiddenWaffle"#1171 , It mostly depends on the NN operations (layers). Could you share the link to the lightest model weights (onnx/.pt..)? We can try to convert them and deploy them to the device. Thanks, Erik

@"erik"#p13867 [from here](https://github.com/open-mmlab/mmpose/tree/6d10b2ec81da7e252016b3154c7fdb46c403ecd8/projects/rtmpose#%EF%B8%8F-how-to-deploy-) [upl-image-preview url=https://discuss.luxonis.com/assets/files/2023-10-31/1698755448-762808-screenshot-from-2023-10-31-08-21-49.png] This appears to be the only model of the RTM Whole body models which provides small/tiny versions, the first entry(t) seems to only have a broken download link for the ONNX model(as the onnx link points to a pth file which is not available) and therefore the pth is the only one available whereas the small has both a pth and ONNX version. Hope this information helps. tiny pth: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmpose-t_simcc-ucoco_dw-ucoco_270e-256x192-dcf277bf_20230728.pth small ONNX: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/onnx_sdk/rtmpose-s_simcc-ucoco_dw-ucoco_270e-256x192-3fd922c8_20230728.zip small pth: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmpose-s_simcc-ucoco_dw-ucoco_270e-256x192-3fd922c8_20230728.pth Then the model structure is described by the [tiny config](https://github.com/open-mmlab/mmpose/blob/main/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmpose-t_8xb64-270e_coco-wholebody-256x192.py) and the [small config](https://github.com/open-mmlab/mmpose/blob/main/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmpose-s_8xb64-270e_coco-wholebody-256x192.py) Really interested to hear the results of this, being able to consume from the OpenMM projects in a more standardized way could be huge for expanding the DepthAI model zoo and provide a lot of really good models for future DAI projects and I would love to aid in this effort in any manner I can.

@"erik"#p13867 Thanks again for the assistance and willingness to help, both you and Luxonis are doing such impactful work in this field that never ceases to amaze me 🙏

@"Matija"#379 , could someone from the ML team try to convert this model and deploy it to rvc2? The openpose2 that we have a demo for is quite slow, perhaps this would be a good alternative.

@"KlemenSkrlj"#862 do you mind taking a look?

Hi @"TheHiddenWaffle"#1171 There were two problems with the ONNX that you've provided: \- Output shapes included dynamic variables where as blobconverter need fixed shape \- .onnx model included HardSigmoid op block which is not supported by AI accelerator on OAK so we have to replace it with combination of more basic operations I'm attaching the fixed models (.onnx and .blob) together with simple inference scripts to test them out - [link](https://drive.google.com/file/d/169r0OiG_bC7w-6ipUUV0rqMvkDAtM34c/view?usp=sharing). It also includes a README with more information and step by step instructions. If you have any additional questions feel free to reach out.

@"KlemenSkrlj"#p14010 thank you so much, I appreciate the steps too as at some point I'd like to train the MMPose model further.

Multi-person Hand Tracker

TheHiddenWaffle

TheHiddenWaffle Made the changes I outlined in my edit, my averages for the double input blob came out to 120 ms, which is still faster by 16% on a per-detection basis but the cost is increasing latency by 73%, and of course cutting the frame rate almost in half(vs the single body model) when only 1 body is present to process. here's the model if anyone on this thread is interested but I don't expect many would be. I also tested 3 inputs but there was no meaningful increase in any performance metric, and the latency jumped another 60 ms.

TheHiddenWaffle

TheHiddenWaffle do you have any insights or workarounds that could solve this? I've thought about it the last couple days but the only solution I could come up with was concatenating an additional dummy row on and then removing it after but that's feels like it could get me into trouble considering how little I understand a out how Jointformer works.

TheHiddenWaffle

@KlemenSkrlj

KlemenSkrlj

TheHiddenWaffle Sorry for a late response. We are still investigating the export of this model and we will let you know as soon as we have any updates. Currently we are having the same problem as you mentioned (2 input rank).

TheHiddenWaffle

KlemenSkrlj Thanks for the update, my apologies if I was too persistent

TheHiddenWaffle

KlemenSkrlj I got it working I think. I edited the PyTorch source for joint former and changed the offending tensor from a [1, 17024] to a [1, 1, 17024], then verified that none of the operations originating from those pieces were affected, which they weren't. I'm working on building out the example I spoke about above for(croppedRotatedFrame, centerSizeAngles) -> (2d keypoints & confidence, 3d keypoints, 3d error), but for now here's the ONNX.

TheHiddenWaffle

Also for anyone out there looking to use pose detection, I built a pytorch model to post-process the keypoints and then used onnx.compose.merge_models to connect it to rtmpose. I used similar techniques to the point cloud tutorial for inputting center/size to be processed as values between 0.0 and 6553.5(the model multiplies by .1) and cos/sin of the rotation angle falling between -1 and 1 by multiplying by 0.0001 then subtracting 1

rr_center = (np.array(center_point) * 10).astype(np.uint16)

center_point, size, rot_angle_rad = (231., 318.), (209., 279.), np.pi / 3

rect = cv2.RotatedRect(center_point, size, np.degrees(rot_angle_rad))

modified_center = (np.array(center_point) * 10).astype(np.uint16)

modified_size = (np.array(size) * 10).astype(np.uint16)

modified_angles = (np.array([np.cos(rot_angle_rad) + 1.0, np.sin(rot_angle_rad) + 1.0]) * 10000).astype(np.uint16)

buff.setData(np.frombuffer(bytes(np.stack([modified_center, modified_size, modified_angles])), dtype=np.uint8))

This allows you to put the RRect config into an ImageManip node and then supply those same params to the model as well and get back pixel coords in the space of the original(pre-manip) image. The rotated aspect of it may be less useful to most unless the camera is rotated on the roll axis but it's there nonetheless.

I'll post it on a github repo later on once I clean up and organize everything but here it is for now.

Once Jointformer is adapted and working I'll be merging it onto the outputs of rtmpose_post_rrect model to get full RGB frame->XYZ kps inference all in one go!

TheHiddenWaffle

TheHiddenWaffle It should be noted that this model does not take into account resize operations, ie: if the ImageManip rrect is 288 x 384 in the original frame the scaling of the model will be wrong, however you can just multiply the modified_size param x and y by the proportion between the frames(in the above example 384/256==1.5).

# ratio of bbox to original frame scale_factor = np.array(size) / np.array(model_input_shape) modified_size = (np.array(size) * 10 * scale_factor).astype(np.uint16)

TheHiddenWaffle

TheHiddenWaffle full (croppedRotatedFrame, centerSizeAngles) -> (2d keypoints & confidence, 3d keypoints, 3d error) model. Uses 4 encoder layers within Jointformer. This model currently isn't "calibrated" for (my/the oak) environment but I think it's cool enough that I'm sharing it now 😅

8 shave

onnx

« Previous Page