• Machine Learning
  • Problem using Yolov5 detection/segmentation custom model on host

Hi Rezahojjatysaeedy
Gave you tried using the YoloDetectionNetwork node instead? It should feature on device decoding.

Example here.
You can provide blob too if you wish, or maybe some minimal code I can use directly.

Thanks,
Jaka

    Hi jakaskerl,

    Thanks for reply. The device side won't work for me. I already have a heavy pipeline and I need to get at least 20 fps. I tried it and it only gives me 7 fps. About the code, I have not incorporated into my pipeline yet. All I'm doing is to replace the blob from line 42 here with my trained blob and also change the 80 elements labelMap list with one element labelMap = ["fissure"] because I only have one class, 'fissure'. I cannot upload the model here probably it's too large 14 MB but I share a link to it.

    https://drive.google.com/drive/folders/1pQaj04wSzYs5fZlmfM1ZQYmKzkFnMa20?usp=sharing

      Hi Rezahojjatysaeedy
      The host decoding is currently expecting a different model with different dimensions. I tested out your model and upon running

      layers = in_nn.getAllLayerNames()
      print("Layers: ", layers)

      I get ['output1_yolov5', 'output2_yolov5', 'output3_yolov5']. Not sure which one to use and what the end resolution should be. Basically every output here is scaled down by a factor of 4 (first 115200, then 28800, then 7200).

      Thanks,
      Jaka

        Thanks jakaskerl, It was a really helpful toward debugging. In main.py there is this line cols = output.shape[0]//10647 where given output.shape[0] = 63888 makes the cols = 6 but all these looks a bit arbitrary. can you please elaborate a little where these numbers are coming from? Maybe this can help me to understand better what's going on? By the way I have no idea why I have three outputs. It must detect a box around eye at the end.

          Hi Rezahojjatysaeedy
          The number 10647 seems to be specific for the stock model used. It's used to properly parse the results from the model.
          When making the model, you should have specified the output layer size. This should translate to .blob file as well. But it will be specific to your model and how you configured the layers.

          Thanks,
          Jaka

            Thanks jakaskerl

            I managed to make it work. But It's very slow, about 7 fps the same speed I was getting on device and unlike device deployment now I'm not getting any detection. When I tested the default blob, `yolov5s_sku_openvino_2021.4_6shave.blob`on host I was getting 18 fps and both blobs have the same size (about 14 MB). Do you have any idea what might be causing this issue?

              Rezahojjatysaeedy
              Check what the bottleneck is.

              • size of the frame passed to the NN
              • host decoding
              • model

              Rezahojjatysaeedy I'm not getting any detection

              Can you make sure this is not just a decoding issue? Perhaps something is incorrectly decoded (consult gpt4 with the output if you can).

              Thanks,
              Jaka

                Hi jakaskerl,

                I trained another network with the same size as your example and I'm getting similar fps as yours. But I noticed a difference in host visualization vs device visualization. On device you used frameNorm() function that normalizes the boxes w.r.t the frame shape. Such a normalization does not exist in host-decoding which makes the box coordinates small float numbers. Now when I use faceNorm on host these are my only boxes:

                x1: 0 y1: 0 x2: 208 y2: 208
                x1: 208 y1: 208 x2: 416 y2: 416

                Playing with iou and conf doesn't make it better. I know that the model must work better as it is detecting correctly on device side. Do you have any idea what might be going wrong on the host implementation?

                  Hi Rezahojjatysaeedy
                  Could you post your current code to the drive. If you have problems with host-side decoding you can usually consult the GPT and there is a high chance it will solve it for you.

                  Thanks,
                  Jaka

                    Hi Rezahojjatysaeedy
                    You are only looking at the largest output.

                    1. Understanding the Outputs: YOLOv5 typically gives three outputs corresponding to three different scales. Each output contains a set of bounding boxes predicted at that scale. The shape of these outputs is usually [number_of_boxes, 5 + number_of_classes], where number_of_boxes depends on the scale.

                    2. Processing Each Scale: You need to process each of these outputs separately. Each output will have its own set of bounding boxes, and you'll need to apply the same decoding logic (converting center coordinates to corner coordinates, applying confidence threshold, and NMS) to each.

                    3. Combining Results from All Scales: After processing each output, you should combine the results to get the final set of detections. This is where NMS is crucial to remove duplicates and overlapping boxes.

                    4. Coordinate Scaling: Since YOLOv5 operates on a normalized coordinate system, you might need to scale the bounding box coordinates back to the original image dimensions.

                    Here's a more detailed approach:

                    
                    def process_output(output, img_width, img_height):
                        num_classes = len(labelMap)
                        num_values_per_detection = 5 + num_classes
                        num_detections = len(output) // num_values_per_detection
                        detections = output.reshape((num_detections, num_values_per_detection))
                    
                        processed_boxes = []
                        for detection in detections:
                            x_center, y_center, width, height, confidence = detection[:5]
                            class_probs = detection[5:]
                    
                            if confidence < conf_thresh:
                                continue
                    
                            class_id = np.argmax(class_probs)
                            class_confidence = class_probs[class_id]
                    
                            # Scale coordinates back to original image size
                            x1 = (x_center - width / 2) * img_width
                            y1 = (y_center - height / 2) * img_height
                            x2 = (x_center + width / 2) * img_width
                            y2 = (y_center + height / 2) * img_height
                    
                            processed_boxes.append([x1, y1, x2, y2, confidence, class_id])
                    
                        # Apply Non-Maximum Suppression
                        boxes_nms = non_max_suppression(processed_boxes, iou_thresh)
                        return boxes_nms
                    
                    # Assuming you have three outputs: output1, output2, output3
                    # And assuming you have the original image dimensions: img_width, img_height
                    
                    boxes_all_scales = []
                    for output in [output1, output2, output3]:
                        boxes = process_output(output, img_width, img_height)
                        boxes_all_scales.extend(boxes)
                    
                    # Final NMS across all scales
                    final_boxes = non_max_suppression(boxes_all_scales, iou_thresh)
                    
                    # Now draw these boxes on the frame
                    for box in final_boxes:
                        frame = draw_boxes(frame, box, len(labelMap))

                    This code assumes that output1, output2, and output3 are the outputs from the three scales of the YOLOv5 model. The process_output function processes each output, scales the coordinates, and applies NMS. Finally, it combines the results from all scales and applies NMS again to get the final set of detections.

                    Hope this helps,
                    Jaka