Problem using Yolov5 detection/segmentation custom model on host

Rezahojjatysaeedy

Hi,

I'm following this tutorial for a model that I trained with my custom dataset on yolov5. I only have one class (eye surface) and I want to detect the bounding box around it. The model is working fine in yolov5 pipeline but when I convert it using Luxonis tool and follow the code in above host-decoding folder I get the following error:

Traceback (most recent call last):
  File "~/depthai-experiments/gen2-yolo/host-decoding/main.py", line 170, in <module>
    boxes = non_max_suppression(output, conf_thres=conf_thresh, iou_thres=iou_thresh)
  File "~/depthai-experiments/gen2-yolo/host-decoding/util/functions.py", line 29, in non_max_suppression
    xc = prediction[..., 4] > conf_thres  # candidates
IndexError: index 0 is out of bounds for dimension 1 with size 0

I wonder what I'm doing wrong or what I need to change?

I also tried other values instead of 4 in xc = prediction[..., 4] and got the same error.
I trained the NN with 640x640 image size. For now I really want to pass this error but my final goal is to be able to get the polygon masks that I used to label the data with for instance segmentation. I wonder if it is possible or not?
By the way, when I print the prediction.shape I get torch.Size([1, 10647, 0]).

I can provide the blob if you want to test it yourself.

jakaskerl

Hi Rezahojjatysaeedy
Gave you tried using the YoloDetectionNetwork node instead? It should feature on device decoding.

Example here.
You can provide blob too if you wish, or maybe some minimal code I can use directly.

Thanks,
Jaka

Rezahojjatysaeedy

Hi jakaskerl,

Thanks for reply. The device side won't work for me. I already have a heavy pipeline and I need to get at least 20 fps. I tried it and it only gives me 7 fps. About the code, I have not incorporated into my pipeline yet. All I'm doing is to replace the blob from line 42 here with my trained blob and also change the 80 elements labelMap list with one element labelMap = ["fissure"] because I only have one class, 'fissure'. I cannot upload the model here probably it's too large 14 MB but I share a link to it.

https://drive.google.com/drive/folders/1pQaj04wSzYs5fZlmfM1ZQYmKzkFnMa20?usp=sharing

jakaskerl

Hi Rezahojjatysaeedy
The host decoding is currently expecting a different model with different dimensions. I tested out your model and upon running

layers = in_nn.getAllLayerNames()
print("Layers: ", layers)

I get ['output1_yolov5', 'output2_yolov5', 'output3_yolov5']. Not sure which one to use and what the end resolution should be. Basically every output here is scaled down by a factor of 4 (first 115200, then 28800, then 7200).

Thanks,
Jaka

Rezahojjatysaeedy

Thanks jakaskerl, It was a really helpful toward debugging. In main.py there is this line cols = output.shape[0]//10647 where given output.shape[0] = 63888 makes the cols = 6 but all these looks a bit arbitrary. can you please elaborate a little where these numbers are coming from? Maybe this can help me to understand better what's going on? By the way I have no idea why I have three outputs. It must detect a box around eye at the end.

jakaskerl

Hi Rezahojjatysaeedy
The number 10647 seems to be specific for the stock model used. It's used to properly parse the results from the model.
When making the model, you should have specified the output layer size. This should translate to .blob file as well. But it will be specific to your model and how you configured the layers.

Thanks,
Jaka

Rezahojjatysaeedy

Thanks jakaskerl

I managed to make it work. But It's very slow, about 7 fps the same speed I was getting on device and unlike device deployment now I'm not getting any detection. When I tested the default blob, `yolov5s_sku_openvino_2021.4_6shave.blob`on host I was getting 18 fps and both blobs have the same size (about 14 MB). Do you have any idea what might be causing this issue?

jakaskerl

Rezahojjatysaeedy
Check what the bottleneck is.

size of the frame passed to the NN
host decoding
model

Rezahojjatysaeedy I'm not getting any detection

Can you make sure this is not just a decoding issue? Perhaps something is incorrectly decoded (consult gpt4 with the output if you can).

Thanks,
Jaka

Rezahojjatysaeedy

Hi jakaskerl,

I trained another network with the same size as your example and I'm getting similar fps as yours. But I noticed a difference in host visualization vs device visualization. On device you used frameNorm() function that normalizes the boxes w.r.t the frame shape. Such a normalization does not exist in host-decoding which makes the box coordinates small float numbers. Now when I use faceNorm on host these are my only boxes:

x1: 0 y1: 0 x2: 208 y2: 208
x1: 208 y1: 208 x2: 416 y2: 416

Playing with iou and conf doesn't make it better. I know that the model must work better as it is detecting correctly on device side. Do you have any idea what might be going wrong on the host implementation?

jakaskerl

Hi Rezahojjatysaeedy
Could you post your current code to the drive. If you have problems with host-side decoding you can usually consult the GPT and there is a high chance it will solve it for you.

Thanks,
Jaka

Rezahojjatysaeedy

Hi jakaskerl

I tried chatGPT. It just gives some general advice about how to debug it and so. I copied the link to the entire host-decoding folder. Currently I'm using best.blob which is a detection model trained in yolov5. It's supposed to draw bounding boxes around the eye fissure.

https://drive.google.com/drive/folders/1H9SQyroWo9O4fa_6Pe8lPZC8INk4gLPw?usp=sharing

jakaskerl

Hi Rezahojjatysaeedy
You are only looking at the largest output.

Understanding the Outputs: YOLOv5 typically gives three outputs corresponding to three different scales. Each output contains a set of bounding boxes predicted at that scale. The shape of these outputs is usually [number_of_boxes, 5 + number_of_classes], where number_of_boxes depends on the scale.
Processing Each Scale: You need to process each of these outputs separately. Each output will have its own set of bounding boxes, and you'll need to apply the same decoding logic (converting center coordinates to corner coordinates, applying confidence threshold, and NMS) to each.
Combining Results from All Scales: After processing each output, you should combine the results to get the final set of detections. This is where NMS is crucial to remove duplicates and overlapping boxes.
Coordinate Scaling: Since YOLOv5 operates on a normalized coordinate system, you might need to scale the bounding box coordinates back to the original image dimensions.

Here's a more detailed approach:


def process_output(output, img_width, img_height):
    num_classes = len(labelMap)
    num_values_per_detection = 5 + num_classes
    num_detections = len(output) // num_values_per_detection
    detections = output.reshape((num_detections, num_values_per_detection))

    processed_boxes = []
    for detection in detections:
        x_center, y_center, width, height, confidence = detection[:5]
        class_probs = detection[5:]

        if confidence < conf_thresh:
            continue

        class_id = np.argmax(class_probs)
        class_confidence = class_probs[class_id]

        # Scale coordinates back to original image size
        x1 = (x_center - width / 2) * img_width
        y1 = (y_center - height / 2) * img_height
        x2 = (x_center + width / 2) * img_width
        y2 = (y_center + height / 2) * img_height

        processed_boxes.append([x1, y1, x2, y2, confidence, class_id])

    # Apply Non-Maximum Suppression
    boxes_nms = non_max_suppression(processed_boxes, iou_thresh)
    return boxes_nms

# Assuming you have three outputs: output1, output2, output3
# And assuming you have the original image dimensions: img_width, img_height

boxes_all_scales = []
for output in [output1, output2, output3]:
    boxes = process_output(output, img_width, img_height)
    boxes_all_scales.extend(boxes)

# Final NMS across all scales
final_boxes = non_max_suppression(boxes_all_scales, iou_thresh)

# Now draw these boxes on the frame
for box in final_boxes:
    frame = draw_boxes(frame, box, len(labelMap))

This code assumes that output1, output2, and output3 are the outputs from the three scales of the YOLOv5 model. The process_output function processes each output, scales the coordinates, and applies NMS. Finally, it combines the results from all scales and applies NMS again to get the final set of detections.

Hope this helps,
Jaka