Air Drums with OAK

Lleront · May 10, 2023

Hey folks,

I wish to program my first Python script with OAK-1 Lite.

My project is "Air Drums", so I could play drums on air, and create music sounds to certain hand gestures or specific locations in the image where my hands "hit" on air.

The closest to it in the concept that I found, was the Sign Language project.

I'd like to get your guidance as to what should I start with or how to build it.

Thanks!

jakaskerl · May 11, 2023

Hi leront,
I guess you can start with hand tracking https://github.com/geaxgx/depthai_hand_tracker. I have to warn you, drums, music or anything time critical will not work well with image classification/detection due to high latency and low FPS.

Hope this helps,
Jaka

Lleront · May 11, 2023

Hey jakaskerl

thanks for your reply! Great input.

I will have a look at that hand tracker.

So how much FPS is needed for the least, in order to have all of this working well in your opinion?

Given a chain of units that each consumes some processing time, I'd say, it looks like this: (Letting go communication costs)

Loading frame time > detection time > calculation location (of the hand in relation to the air drum) time > Playing sound > Loading another frame.

Could there be another unit that consumes time here? And which is the bottleneck that I should set the FPS constraints by?

I'd say that for the part where the sounds are played without cuts, I'd need a parallel process for it.

jakaskerl · May 12, 2023

Hi leront
You would need some extreme hardware to run NN augmented virtual instruments. This is especially a problem with image/video recognition. The FPS of the camera should be >100 just to recognize your hand movement (hitting the drum takes on average under 0.1s depending on the style of music) and even then, you would only have 1-2 frames of moving hand (probably blurry due to low shutter speed of the camera) for NN inference. Neural networks take very long to run inference (time you don't have). Then there is the latency of the signal itself (https://docs.luxonis.com/projects/api/en/latest/tutorials/low-latency/). Personally, anything above 50ms (a stretch already) is unusable.

So if you are just doing a research and you can accept high latency and very slow hand movements, go ahead, it should work. But if you actually want to play something on beat, I would highly encourage you to look for other ways. Image recognition is currently not advanced/fast enough to allow that without some high-end hardware.

Thanks,
Jaka

Lleront · May 12, 2023

jakaskerl

Man, that's a great reply. I've learnt so much from you. Thanks.

So I've played with that repo of hand tracker, and it seems to have some regression task to find out whether there is any hand, but once it's detected - there's another class, named "Hand_tracker" that just keeps tracking it, saving time and increasing performance. So funnily, once there's a hand in the frame, the performance go better.

Your latter link has taught me that I could maybe switch to MONO H.265 encoded frames, and then by trading off colors and resolution for higher FPS (120), I could then get away with the bottleneck.

Do you think that the performance of the tracking/detection would go worse (or not significantly much) by giving up on colors/resolution?

Thanks!

jakaskerl · May 12, 2023

Hi leront
Currently, the way this is implemented for the host side postprocessing:

In this case, you could probably switch to mono camera both models would still work.

However, running the pipeline in edge mode (with --edge; everything runs on the device), you need both mono cameras anyway since you are also using the MobilenetSpatialDetection model, so changing the main camera wouldn't achieve anything.

I'd suggest you look inside the HandTracker.py to see how the pipeline is created and if you are able to change it so it performs faster.

Also, the image above was created using pipeline graph tool if you wish to visualize the pipelines.

Hope this helps you,
Jaka