convert yolo4vtiny to blob

TedHerman

Hi,

I have a yolov4-tiny model that I'd like to try. First I need to convert it to a blob. There is a luxonis notebook showing how to do this, but it runs into trouble because the Tensorflow version has changed from 1.x to version 2. I'm wondering if someone has a new version of the code in luxonis/yolo2openvino which has been updated to use TF version 2?

jakaskerl

Hi TedHerman
I don't think this was updated from Luxonis' side. Any specific reason why you are using yolov4? I think the v5 and v6 run much faster on the OAK devices.

Thanks,
Jaka

TedHerman

jakaskerl I happened on a specific model (from a couple years back) that happens to be yolo 4 and was hoping to try it.

I did try and fix up the yolo2openvino programs using the TF v1 to v2 converter, but it couldn't handle the decorator "@tf.contrib.framework.add_arg_scope" in its conversion. I tried commenting that out, but later in fact the conversion failed with some decorator problem within tf.

jakaskerl

Hi TedHerman
Will get back to you once I have more information.

Thanks,
Jaka

JanCuhel

Hi @TedHerman,

could you please share with us the model alongside information about it (e.g. how many classes, whether the model is a tiny version or not, and your desired input shape) so that we could debug it?

Thanks!

Best
Jan

TedHerman

JanCuhel I'm trying to get a blob for hands-only detection. There are numerous projects for this.

For yolov4, I found cansik/yolo-hand-detection/tree/master which seemed promising, had I been able to convert it using yolo2openvino. Maybe @Matija has a good suggestion, just make a different environment for TF1 and perhaps that will work (my desktop system is somewhat limited; no GPU).

Another idea would be to build a yolov7 model. I have started trying this, but ran into a few obstacles/questions.

First, the dataset for the project mentioned above is has some weird Matlab format. So probably better not to try that as a starting point.
Other hand detection projects use MS COCO's hand collection, which is straightforward. I don't have a sense of how many images and what parameters to use (in spite of the paper/repo WongKinYiu/yolov7). Still, seems like a path to try, and generally yolov7 would have clear advantages.
I started some training specifying yolov7-tiny.pt as the starting weight file, which is around 12MB. After some epochs, the model in last.pt is 284MB. Is that normal?
For my application, I will need the detection to work for gloved and non-gloved hands. Maybe that will just work, I did not look enough at the input images. If it fails to detect gloved hands, then probably more images will have to be included, right?

Matija

A quicker solution would be to just run yolo2openvino offline with TF1. You can also clone the notebook locally and install the TF1 instead.

JanCuhel

You shouldn't need a lot of computational power (no GPU required) in order to be able to export the pretrained models using the yolo2openvino locally.

We have our example Colab notebook with YoloV7 in here, so maybe that could help you get started. After few epochs, the model shouldn't get any bigger, try to look for best.pt weights, it should be stored in somewhere like runs/train/exp/weights/best.pt. In case the trained model would fail to detect gloved hands, yes, you will definitely need to include images with gloved hands in the dataset.

TedHerman

JanCuhel I did manage to train a YoloV7 network and it works pretty well: excellent Fps and reasonably good detection. The NN is [416,416], so the RGB preview is also that size; I tried a larger preview size linking to the host + an ImageManip outputting [416,416] to the NN, and this appears to be equivalent. My question is about getting a wider aspect ration to exploit nearer the FOV of the OAK. What I get is cropping to force the square NN dimension, which amounts to around 2/3 FOV.

There may be the possibility to train a network that would harness an RGB preview size [1280,800] and scale that to [416,256] for example, but how do I train for that? I did try using --rect and --img 416 256: the results after 100 epochs as seen in F1, R, and P graphs were terrible. Some searching around makes me guess that just using --rect plus --img 416 might be the correct way, but how could this work? After all, the NN will expect a particular size of image for input. Is that just specified when doing blob conversion? By the way, the training images appear to be 480x360, if that matters.

TedHerman

JanCuhel I never managed to get yolo2openvino working, even when I tried on the desktop. Creating an environment with Python 3.6, plus installing all the requirements got stuck on a large compile/build step. Having already sunk some hours into this, I realized that finding a ready-made Docker container is likely a better option, but even that would be speculative as to whether it would work.

Matija

A note on .pt file getting bigger -- I believe YOLOs (v5-v8 in particular) might save some additional information about optimizer, schedulers, and training states inside the .pt file itself. Which could explain why the file size increases. Once you convert the model to a blob that can run on our devices, it should get smaller.

TedHerman

Matija The answer turned out to be that the last step after going through all the epochs does does optimization which reduces the size of the .pt file. The intermediate .pt files are all significantly larger.

TedHerman

TedHerman Update: I used setResizeThumbnail() instead of setResize() on the ImageManip. This was following a suggestion by @JanCuhel for another topic. It had the effect of getting detections in the full FOV, though the detection bounding boxes were off on the display window. To fix that problem, the code now scales according to RGB preview width (1280). Weirdly, this same scaling had to be used on both x and y axes. Not sure I understand why this works.

Matija

TedHerman

Not entirely sure how to properly configure the input shape for YoloV7 training. One thing you could do though, is determine the shape of the neural network that you want to use at inference based on the aspect ratio that you get from OAK (e.g. for 16:9 you could use 512x288 or 768x416 is also a good approximation), and then train on square images of 288x288 or 416x416, respectively. Given that the network is fully convolutional, you can export it for arbitrary shape as long as both dimensions can be divided by 32. And setting the input dimension to `height x height` should preserve the size of the objects the network will see during training and inference. Performance might be slightly worse than with non-square shape, but should be relatively close.

Re size - possibly they remove some intermediate values that are stored for continuing the training. I'd assume that if you use the .pt with larger weight with our tools, you still get the normal sized compiled model.

Interesting to hear on yolo2openvino. I think a container could likely solve the issue, but given that this path is being deprecated in favor of more recent models due to better accuracy/throughput trade-offs, it is unlikely something that we will end up supporting.