Throughput pipeline

Ssbellini · Feb 8, 2023

Hi everyone!
We have some questions regarding the pipeline.
We have tested the body detector from OpenVino (person-detection-0201) alone in single-NN pipeline (camera->image manip->NN->output) and we have measured an average FPS of 21; then, we tested the face detector (face-detection-retail-0004) alone (in the same simple pipeline) and we have measured a FPS of 30. We have then tried to link the two neural networks together (one after the other, the body is cropped on the original image and the result is sent to the face detector) and measured the resulting FPS: while with no bodies on the screen the FPS remained 21, having one body (and thus "enabling" the face detector) brings down the average FPS to 14. We don't know why this behavior happens as the second neural network should be faster than the first one and, as per our knowledge of the concept of pipeline, the overall throughput time should be (almost) equal to the throughput of the slowest node.

From a previous (private) reply from the team it seems that it's because we are running two NNs now that share the same resources. Does this mean that we already used up all the resources the sensor has? We tried doing as instructed here: https://docs.luxonis.com/projects/api/en/latest/tutorials/debugging/ with the two NNs and here is the result:

Memory Usage - DDR: 235.12 / 340.43 MiB, CMX: 2.50 / 2.50 MiB, LeonOS Heap: 24.41 / 77.32 MiB, LeonRT Heap: 20.64 / 41.23 MiB
Temperatures - Average: 40.39 C, CSS: 41.66 C, MSS 39.81 C, UPA: 40.04 C, DSS: 40.04 C
Cpu Usage - LeonOS 21.19%, LeonRT: 56.91%

From this it seems only CMX is full while other memories and CPUs still have resources left. Also, the two networks from openVino only make 1.768 and 1.067 GFLOPS respectively so the sensor should keep up (it's an OAK-1 and should have 1.4 TOPS for AI). Are we doing something wrong?

Thank you for your time
Kind regards

Simone

erik · Feb 8, 2023

Hi sbellini ,

the overall throughput time should be (almost) equal to the throughput of the slowest node.

That's not the case in this example, as you have 2 NNs sharing the same resources, and while running them simultaneously, it will, of course, decrease the FPS. Example: running 3x NNs (that each run 30FPS alone) in parallel also wouldn't yield final FPS of 30FPS - but likely bit lower than 10FPS.

Regarding resources, you can check Resource debugging - so SHAVEs/CMXs that you have, and NCEs that are allocated per thread (docs here on HW resources). CPU isn't used for NN inferencing, it would be very very slow.
Regarding tops/gflops - perhaps it would be best to use openvinos benchmark_app tool to determine how fast you can run the model on cpu/gpu/vpu (myriad).
TLDR; yes, if a model can only run at 21FPS you already hit the resource limit there. Adding a second one will just decrease performance/fps.
Thoughts?
Thanks, Erik

Ssbellini · Feb 13, 2023

Thank you for the quick reply!
We tried analyzing the HW resources as you suggested, here are the results.

With a single NN (body detector, compiled with 5 SHAVEs) we have:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
As was also shown on the documentation you linked me, the NN have allocated 13 SHAVEs and 13 CMX slices. In total, 14 SHAVEs, 16CMX slices and 2 NCE are allocated (even though the DetectionNetwork only uses 10).

I also tested the one-stage pipeline using the face detector (the one running at 30FPS, compiled with 4 SHAVEs) and here are the results:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 4, ddr: 2728832
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
The resources allocated seem to be the same as before, even though the FPS are higher with this NN.

With multiple NN(body, face, age/gender) we have:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
NeuralNetwork(18) - Needed resources: shaves: 4, ddr: 368640
DetectionNetwork(11) - Needed resources: shaves: 4, ddr: 2728832
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
NeuralNetwork(18) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
DetectionNetwork(11) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1

So, is the keypoint here the fact that we are using all NCE and (almost) all SHAVEs with a single NN? Here we should be using 26 SHAVEs and 6 NCEs. If so, then I understand why it is not behaving as a pipeline when using more than one.

I tried compiling all the NNs with 2 SHAVEs in order to keep the total number of allocated SHAVEs less than the available amount (2SHAVEs * 2 thread per NN * 3 NNs <13), but the performances worsen.

Also, is the body detection NN slower than the face detection because of the internal layout of the network?
We tried running the benchmark_app as you suggested but we couldn't make it run on MyriadX, on CPU the FPS of the face-detection are more than twice the one of the body-detection NN

Thank you again and kind regards

Simone

erik · Feb 13, 2023

Hi sbellini ,
NNs are sharing the NCEs/SHAVE cores, so I ideally you would compile all models for AVAILABLE_SHAVES/2 shaves, and they will run in parallel.

Also, is the body detection NN slower than the face detection because of the internal layout of the network?

Yes, the latency/FPS depends on the network itself. Face detection is quite a simple network as it has been designed specifically for that task.
Regarding bechmark_app - I quickly tried it and it seems you need to activate the openvino environment, but their docs are a bit lacking to be able to run it myself..
Thanks, Erik

Ssbellini · Feb 17, 2023

So if we have the best performances with AVAILABLE_SHAVES/2 shaves and we know that each NN uses two thread with one NCE and AVAILABLE_SHAVES/2 shaves each, it means that each NN running will always be using all available resources. In this way, we are not in fact exploiting the pipeline advantages when using more than one network as the resources are indeed already saturated.
Do you have any advice on how to speedup pipelines with multiple NNs beside using less computation-intensive networks? If I modify the face-detection NN allowing it to take 5x body images together (5x3xwidthxheight) that would help with increasing the FPS when multiple bodies are detected, but how could I create such matrices? And would ImageManip nodes be able to handle them? Do you have any demo exploiting this technique?
Thank you again!
Kind regards
Simone

erik · Feb 17, 2023

Hi sbellini ,

Do you have any advice on how to speedup pipelines with multiple NN

There are ways - editing the models; switching the layers for faster ones with minimal loss, improving layer computation via using opencl, etc., but it's far from easy.
The idea I shared might not actually work. And as it's quite specific and requires a lot of effort/know-how we haven't created any example for it. I can ask ML team for any pointers, but a lot of work/learning/trying out would still fall on your team.
Thanks, Erik