Thank you for the quick reply!
We tried analyzing the HW resources as you suggested, here are the results.
With a single NN (body detector, compiled with 5 SHAVEs) we have:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
As was also shown on the documentation you linked me, the NN have allocated 13 SHAVEs and 13 CMX slices. In total, 14 SHAVEs, 16CMX slices and 2 NCE are allocated (even though the DetectionNetwork only uses 10).
I also tested the one-stage pipeline using the face detector (the one running at 30FPS, compiled with 4 SHAVEs) and here are the results:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 4, ddr: 2728832
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
The resources allocated seem to be the same as before, even though the FPS are higher with this NN.
With multiple NN(body, face, age/gender) we have:
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
ColorCamera allocated resources: no shaves; cmx slices: [13-15]
ImageManip allocated resources: shaves: [15-15] no cmx slices.
DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
NeuralNetwork(18) - Needed resources: shaves: 4, ddr: 368640
DetectionNetwork(11) - Needed resources: shaves: 4, ddr: 2728832
DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
NeuralNetwork(18) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
DetectionNetwork(11) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
So, is the keypoint here the fact that we are using all NCE and (almost) all SHAVEs with a single NN? Here we should be using 26 SHAVEs and 6 NCEs. If so, then I understand why it is not behaving as a pipeline when using more than one.
I tried compiling all the NNs with 2 SHAVEs in order to keep the total number of allocated SHAVEs less than the available amount (2SHAVEs * 2 thread per NN * 3 NNs <13), but the performances worsen.
Also, is the body detection NN slower than the face detection because of the internal layout of the network?
We tried running the benchmark_app as you suggested but we couldn't make it run on MyriadX, on CPU the FPS of the face-detection are more than twice the one of the body-detection NN
Thank you again and kind regards
Simone