Multi-input Vision Transformer quantization

unseeing6866

Has anyone had success trying to convert with SNPE and quantize a vision transformer model? I've been trying to work with this exact one: https://huggingface.co/minchul/cvlface_adaface_vit_base_kprpe_webface12m.

I could not for the life of me get it working with the Hub, but I managed to gain some traction with the modelconverter library. Using the latest SNPE (2.44 something) I managed to get it converted down to FP16, but when I attempt quantization the results are unusable. The model is a feature extractor, but the features vary way too much from the original to be usable. Unsure if this is a function of trying to quantize a Vision Transformer model, or if I've done something wrong. I managed to convert and quantize a resnet100 with no issue, so I thought it would be possible for a ViT as well.

Has anyone managed to quantize and run a Vision Transformer on depthaiv3 before? How about a multi-input model?

KlemenSkrlj

Hi @unseeing6866 ,
Happy to see that you are are experimenting with more advanced models.
We are aware that transformer based architectures are very tricky to quantize correctly and with regular quantization techniques (INT8 + "normal" calibration dataset) we did observe similar issues where although the conversion goes through the actual output of the model is not really usuable, the drop in performance in too much. Unfortunately we don't yet have a universal way to solve this.
One approach that should help is mixed precision conversion: either INT8/INT16 which is already supported in modelconverter or INT8/FP16. The latter is not yet supported in modelconverter because of some pending bugs in SNPE which we reported to Qualcomm and are awating results on. But there is a way to pass in a custom quantization encoding .json file to the SNPE conversion commands where you specify which blocks should use which precision. This way you have full control and can put some layers that are known to not be stable specifically to FP16. Because at the end of the day there are probably just a few OPs in the network that are making the whole model after quantization unstable. Of course mixed precision models are not "free", they are slower than pure INT8 and if you are not careful they can get even slower than FP16 because before and after any change of precision inside the network a conversion needs to happen.
We internally decided that we first need to build out the tools to help us properly evaluate and compare models and after we can use them to solve issues like this. For that we'll be releasing an evaluation framework where you'll be able directly compare different SNPEs on same dataset and actually compute metrics. This way we can exactly see how much the performance is dropping while we lower the precision. After that we hope that Qualcomm addresses the pending issues with mixed precision and that we can pick up on the already started scope of work where we'll try to offer automatic mixed precision conversion which will decide what are the operations that should be in what precision based on their quantization sensitivity.
Looking at the timeline we hope to get this out in the next couple of months.
And adding an extra note on the SNPE versions: Currently latest DepthAI uses 2.32.6. We'll be bumping that up to 2.41 with the next DepthAI release. And then we'll be bumping the SNPE version every quarter. Our plan is to match the SNPE version that gets shipped with the latest Qualcomm Linux Image which is released quarterly as well (but with a bit of delay because we need to test and integrate every new SNPE version across our stack first).

As for the multi-input models: Those should be generally supported, we have quite a few of them in the Zoo and oak-examples already. One of those is for example YOLO-World model and its example here.

I hope this works.
Best,
Klemen