Yolov5 difference in performance pytorch vs blob

EwoudPool

I've been training my own yolov5 model, which shows quite a nice result when I run inference using the yolov5 repo, but the performance is significantly different (and worse) when I run it on an oak-D. I've been trying to figure out what the culprit is, but I am unsure how I could properly debug this.

My current focus is to get the same result from inference in pytorch as on the camera, and right now I'm using the yolov5s pretrained model from the yolov5 repo to make sure that my own model is not the problem. I've resized one of their example images to 448x448, my target size, see below.

In the pytorch repo, I can get an output image with the folowwing command:
python detect.py --weights yolov5s.pt --source bus2.jpg --img 448

Which gives me the result below. I've removed the label name so I can get those confidence values printed out.

Then, I use the code below to run the same image through the oak-d, with the .blob generated by the luxonis tool. (specify location of yolov5s .blob, .json and image yourself)


from pathlib import Path
import depthai as dai
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os


# load model json and blob
model_name = "yolov5s"
model_dir = Path("yolov5s")
model_config_path = model_dir / (model_name + '.json')
with open(model_config_path) as fp:
    config = json.load(fp)

model_blob_path = model_dir / (model_name + '.blob')
model_config = config['nn_config']
labels = config['mappings']['labels']
metadata = model_config['NN_specific_metadata']
coordinate_size = metadata['coordinates']
anchors = metadata['anchors']
anchor_masks = metadata['anchor_masks']
iou_threshold = metadata['iou_threshold']
confidence_threshold = metadata['confidence_threshold']


# build pipeline
pipeline = dai.Pipeline()
detection = pipeline.create(dai.node.YoloDetectionNetwork)
detection.setBlobPath(model_blob_path)
detection.setAnchors(anchors)
detection.setAnchorMasks(anchor_masks)
detection.setConfidenceThreshold(confidence_threshold)
detection.setNumClasses(len(labels))
detection.setCoordinateSize(coordinate_size)
detection.setIouThreshold(iou_threshold)
detection.setNumInferenceThreads(2)
detection.input.setBlocking(False)
detection.input.setQueueSize(1)

xin = pipeline.create(dai.node.XLinkIn)
xin.setStreamName("frameIn")
xin.out.link(detection.input)

detection_out = pipeline.create(dai.node.XLinkOut)
detection_out.setStreamName("detectionOut")
detection.out.link(detection_out.input)

device = dai.Device(pipeline)
qIn = device.getInputQueue("frameIn")
qOut = device.getOutputQueue("detectionOut", maxSize=10, blocking=False)

# Create ImgFrame message
image = cv2.imread("bus2.jpg")
imsize = 448
img = dai.ImgFrame()
img.setData(image.transpose(2, 0, 1))
img.setWidth(imsize)
img.setHeight(imsize)
qIn.send(img)
frame_out = qOut.get()

fix, ax = plt.subplots()
ax.imshow(image[:, :, [2, 1, 0]])

for detection in frame_out.detections:
    xmin = detection.xmin
    xmax = detection.xmax
    ymin = detection.ymin
    ymax = detection.ymax
    xpos = np.array([xmin, xmax, xmax, xmin, xmin]) * imsize
    ypos = np.array([ymax, ymax, ymin, ymin, ymax]) * imsize
    print(detection.confidence)
    ax.plot(xpos, ypos)

plt.show()

Which gives me roughly the same bounding boxes, but not exactly the same as the pytorch implementation. The detection confidences are also a bit off (compare below with those in the image):

0.887598991394043
0.8628273010253906
0.8618507385253906
0.7907819747924805
0.4408433437347412

Unfortunately for me, the results are significantly off when I do the above steps with my own network, on an image of my own. And the fact that there is still a difference when I do it with the general yolov5 model makes me believe that something in the conversion from .pt to .blob is messing it up for me.

How would you suggest I further debug this? Is it reasonable to believe something is happening is the conversion from .pt to .blob, and could I counteract it? If you want I can also send you my own trained model and example image, but I'd rather not share that publicly.

erik

Hi EwoudPool ,
So the quantization from INT32 (on the computer) to FP16 (inference on OAK) can definitely affect the accuracy a bit. In your example, the accuracy drop isn't all that bad - some layers (iirc some segmentation architecture) suffer a lot more from this. Thoughts?
Thanks, Erik

EwoudPool

Hey erik ,

That could very well be the cause, but how can I verify that that's the sole cause? The condensed issue is, I have a trained neural network that works very well on my computer and significantly worse on the oak-d. I could use a hand-wavy explanation and attribute it all to the precision conversion, but that gives me no clear way to improve my results, other than swapping out my oak-d for something else to do the NN computation. My hope is that I can truly isolate and reproduce the effect, so that I can hopefully find a way to improve my performance on the oak-d directly.

To isolate the issue, I have tried keeping everything FP32 (are you sure about INT32?) by following the "manual" conversion steps in the colab notebook and switching the two conversions from FP16 into FP32, but I still have a discrepancy between the pytorch output and the oak-d output in my bus example. This could point to that there's something else causing the degrading effect (could it be the argument --reverse_input_channelthat's passed on to mo ? Or perhaps an argument is missing?). Where would you suggest I look next?

Regards,
Ewoud

erik

Hi EwoudPool ,
Before running on OAK, I'd first double check the accuracy with OpenVINO Inference engine (IE), both on CPU (to determine whether conversion to xml/bin using mo was ok) and on MYRIAD (so what OAK is using, that uses FP16). And OAK should return exactly the same result as IE. RGB/BGR is a common culprit, or normalization (mean/scale values). More info here:

Thoughts?
Thanks, Erik

EwoudPool

Hey erik,

These resources sound exactly like what I was looking for! I'll dive into them and will report back on success or failure.

Regards,
Ewoud

EwoudPool

Hey erik ,

Thank you for the two links you shared, they have clarified quite a bit! Using my own trained model, I have followed some form of the steps outlined in https://docs.luxonis.com/en/latest/pages/tutorials/deploying-custom-model/#deploying-custom-models, and verified the scale and mean to be correct. The effect of the model optimizer converting to FP16 (i.e. running mo with or without --data_type FP16) were also minimal.

The last step on that page, https://docs.luxonis.com/en/latest/pages/tutorials/deploying-custom-model/#testing-accuracy-degradation-due-to-fp16-quantization, seems to be the culprit. Changing exec_net = ie.load_network(network=net, device_name='CPU') to exec_net = ie.load_network(network=net, device_name='MYRIAD') significantly changes the output. Comparing the confidence of the bounding boxes on CPU vs. MYRIAD, I see some bounding boxes with either a confidence 0.3 lower, or 0.6 higher.

Given that the issue seems to be the quantization, how would you suggest I proceed? Note that while it is my own trained model, it is a yolov5 model which does not (necessarily) seem to be affected by the quantization.

erik

Hi EwoudPool ,
Besides further training/finetuning the model to get better accuracy (even on cpu), going down the rabbit hole of determining which layer is the most affected is quite tedious, so maybe just trying on different architectures (eg YOLO v6/v7/v8) would be faster.

EwoudPool

Hey erik ,

Do I then understand correctly that there is a difference between running a model at FP32@CPU and FP16@CPU, as well as a difference between running a model at FP16@CPU and FP16@VPU?

erik

Hi EwoudPool ,
I don't think there should be a difference in FP16@CPU and FP16@VPU, but are you sure that the IE ran inference on FP16 on CPU?

EwoudPool

Hey erik,

I'm not really sure, I don't know if I can force IE to run on the cpu with FP16. However, I'm using a model created with mo with the flag --data_type FP16, so I assume the CPU runs it with FP16?

erik

Hi EwoudPool ,
Heh OpenVINO IE is quite funky🙂 I think it will still use INT32, as it also doesn't support FP16: https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_CPU.html