alexandrebenoit
Hey, we didn't do a deep dive into slow operations to point out where exactly the issue lies. Based on experience, I would say it's slow because of:
- A lot of splitting, slicing, and concatenations.
- SiLU - you can see there are a lot of "branch-outs" due to SiLU activation. Comparing this with YoloV6 which uses ReLU and reparametrization trick like in RepVGGs, you can see V8 or similar is slower.
- MHSA module definitely doesn't help and is likely the "cherry on top".
You can see the ONNX file we use here for reference. Feel free to compile this to blob and benchmark it. If you want to dive into optimization a bit yourself you can use this as the baseline.
If you want per-op performance, OpenVINO provides also a benchmark app that can return per-layer latencies. Note that we are more focused on releasing this rather than optimizing, given that the gain for nano version is 1% mAP compared to V6.