Speed of "stereo neural inference" VS "neural inference fused with depth map"

TheHiddenWaffle

I'm looking for the fasted way to get 3D coordinates for a hand landmark inference, and I don't mind a significant drop in depth accuracy(even up to 5% less accuracy would be fine).

At https://docs.luxonis.com/en/latest/pages/spatial-ai/ there are two given options for obtaining a depth reading when working with landmarks, the ones listed in the title of this thread.

My question is which of these would be faster? I have a lot of other unrelated processing to do so any significant difference in time between the 2 would be impactful. Would the answer change for 4 hands vs 1 hand?

erik

Hi TheHiddenWaffle ,
The neural inference fused with depth map would be faster, as the main bottleneck would be the AI performance.
In case of stereo neural inference, you are running the same AI model twice (for left and right image), then doing triangulation to get 3D point.
In case of neural inference fused with depth map, you are running AI model once, then mapping the AI results to stereo depth image to get 3D point.
Thoughts?
Thanks, Erik

TheHiddenWaffle

erik Intuitively that's what I thought would be the case but I figured I should seek confirmation rather than just assuming and possibly getting it wrong with my beginner/intermediate grasp on depthai.

This might not be all the way relevant to the original thread, but I'm thinking of utilizing the OAK-D-SR-POE for this hand tracking and did notice that the docs say that the rgb cameras are 1 MP, as opposed to the other OAK D devices which boast 12 MP rgb cameras, will this be an issue for implementing hand tracking as in https://github.com/geaxgx/depthai_hand_tracker ?

TheHiddenWaffle

***Given of course that the tracked hands will be in sort range(30-50in from the camera)

erik

Hi TheHiddenWaffle , At max 1.5m, I don't think 1MP will be the bottleneck to find the hand / recognize the hand landmarks. At eg. above 2 meters I'd imagine this could be an issue.