Hi all, maybe this is a bit of a general question but I thought I could find people with the specific knowledge here. I am currently using this yolov8n model from the DepthAI model zoo and I can run it fine on an OAK camera.

Now what I would like to do is have a script that runs it on a an OAK with a recorded video as input and, if no OAK is found, then run the model on the laptop CPU with OpenVINO.
I was able to do this with other models but I'm struggling with this specific yolo version because the OpenVINO demos only go up Yolov4 and when I try to plot the output the boxes take up most of the image, which makes me think that I might need to change the code from the OpenVINO python demos specifically for Yolov8.

Since the model was created as part of the DepthAI model zoo I was wondering if someone could give me some insight on this model or on how the output needs to be processed. Thanks in advance!

    Letty
    Can you show the host side decoding you are using? This is generally not a depthai question (CPU running the interference) so GPT can probably help you a ton as well.

    Thanks,
    Jaka

    I know it's not strictly DepthAI but if I could get any help it would be very much appreciated!

    I tried a few different things, but the last thing I tried was using the script from this answer https://stackoverflow.com/a/77532605/14098183 with the relevant modifications

    import cv2
    from openvino.runtime import Core
    
    ie = Core()
    
    model = ie.read_model(model="models/yolov8n_coco_640x352/yolov8n_coco_640x352.xml")
    compiled_model = ie.compile_model(model=model, device_name="CPU")
    
    input_layer_ir = compiled_model.input(0)
    output_layer_ir = compiled_model.outputs
    print(output_layer_ir)
    
    image = cv2.imread("../test.jpg")
    N, C, H, W = 1, 1, 352, 640
    resized_image = cv2.resize(image, (W, H))
    input_image = cv2.dnn.blobFromImage(resized_image, 1/255, (W, H), [0,0,0], 1, crop=False)
    output = compiled_model([input_image])[output_layer_ir]
    output = cv2.transpose(output[0])

    but I encountered a problem, which is the fact that the DepthAI model doesn't return a single output but it's in three layers

    [<ConstOutput: names[output3_yolov6r2] shape[1,85,11,20] type: f32>, <ConstOutput: names[output2_yolov6r2] shape[1,85,22,40] type: f32>, <ConstOutput: names[output1_yolov6r2] shape[1,85,44,80] type: f32>]

    So what I'm not sure about now is how to process the results from these three layers to obtain the final detections.
    I assume they are the layers related to the three different scales, but how do I scale the detection coordinates and sizes? I'm asking because I can find examples for Yolov3 but that had anchors, and Yolov8 shouldn't have anchors so I'm wondering how does DepthAI process the results

    The only other option I'm thinking is that I could convert the yolov8n model from ultralytics and use that one, but I wanted to use the DepthAI one to make sure that the results from the laptop and an OAK camera are the same

      Hi Letty
      GPT:

      Understanding the Output Structure

      Output Shapes:

      •	Output1: [1, 85, 44, 80] (Large scale)
      •	Output2: [1, 85, 22, 40] (Medium scale)
      •	Output3: [1, 85, 11, 20] (Small scale)

      Interpretation:

      •	Batch Size: 1
      •	Channels: 85 (This usually represents 5 bbox parameters + 80 class probabilities for COCO dataset)
      •	Height & Width: Grid dimensions for that scale.

      Grid Sizes:

      •	Output1 (Large scale): Grid size 80 x 44
      •	Output2 (Medium scale): Grid size 40 x 22
      •	Output3 (Small scale): Grid size 20 x 11

      Strides:

      Given your input image size is 640 x 352, the strides can be calculated as:

      •	Stride1: 8 (640 / 80)
      •	Stride2: 16 (640 / 40)
      •	Stride3: 32 (640 / 20)
      import cv2
      import numpy as np
      from openvino.runtime import Core
      
      # Initialize OpenVINO
      ie = Core()
      model = ie.read_model(model="models/yolov8n_coco_640x352/yolov8n_coco_640x352.xml")
      compiled_model = ie.compile_model(model=model, device_name="CPU")
      
      # Get input and output layers
      input_layer_ir = compiled_model.input(0)
      output_layers_ir = compiled_model.outputs
      
      # Read and preprocess the image
      image = cv2.imread("../test.jpg")
      input_height, input_width = 352, 640
      resized_image = cv2.resize(image, (input_width, input_height))
      input_image = resized_image.transpose(2, 0, 1)  # HWC to CHW
      input_image = input_image[np.newaxis, :] / 255.0  # Normalize and add batch dimension
      
      # Run inference
      outputs = compiled_model([input_image])
      
      # Processing parameters
      num_classes = 80
      conf_threshold = 0.25
      iou_threshold = 0.45
      
      # For collecting all detections
      all_detections = []
      
      # Strides and grid sizes for each output layer
      strides = [8, 16, 32]
      output_shapes = {
          0: (44, 80),  # Output1: [1,85,44,80]
          1: (22, 40),  # Output2: [1,85,22,40]
          2: (11, 20),  # Output3: [1,85,11,20]
      }
      
      for idx, output_layer in enumerate(output_layers_ir):
          output = outputs[output_layer]
          grid_h, grid_w = output_shapes[idx]
          stride = strides[idx]
      
          # Reshape and permute the output to [batch, grid_h, grid_w, channels]
          output = output[0].transpose(1, 2, 0)
          output = output.reshape(-1, 85)
      
          # Apply sigmoid to the objectness score and class scores
          output[:, 4:] = 1 / (1 + np.exp(-output[:, 4:]))
      
          # Filter out low confidence detections
          objectness = output[:, 4]
          mask = objectness > conf_threshold
          filtered_output = output[mask]
      
          if filtered_output.size == 0:
              continue
      
          # Get coordinates, objectness, and class scores
          x = filtered_output[:, 0]
          y = filtered_output[:, 1]
          w = filtered_output[:, 2]
          h = filtered_output[:, 3]
          scores = filtered_output[:, 5:] * filtered_output[:, 4:5]
      
          # Get class IDs and scores
          class_ids = np.argmax(scores, axis=1)
          class_scores = scores[np.arange(len(scores)), class_ids]
      
          # Only keep detections with class score above threshold
          keep = class_scores > conf_threshold
          x = x[keep]
          y = y[keep]
          w = w[keep]
          h = h[keep]
          class_ids = class_ids[keep]
          class_scores = class_scores[keep]
      
          # Calculate positions on the original image
          grid_x, grid_y = np.meshgrid(np.arange(grid_w), np.arange(grid_h))
          grid_x = grid_x.flatten()[mask][keep]
          grid_y = grid_y.flatten()[mask][keep]
      
          # Decode bounding boxes
          x = (x + grid_x) * stride
          y = (y + grid_y) * stride
          w = np.exp(w) * stride
          h = np.exp(h) * stride
      
          # Convert to [x1, y1, x2, y2]
          x1 = x - w / 2
          y1 = y - h / 2
          x2 = x + w / 2
          y2 = y + h / 2
      
          # Append detections
          for i in range(len(x1)):
              detection = [x1[i], y1[i], x2[i], y2[i], class_scores[i], class_ids[i]]
              all_detections.append(detection)
      
      # Convert to numpy array
      all_detections = np.array(all_detections)
      
      # Apply Non-Maximum Suppression
      if len(all_detections) > 0:
          boxes = all_detections[:, :4]
          scores = all_detections[:, 4]
          class_ids = all_detections[:, 5].astype(int)
      
          # Perform NMS
          indices = cv2.dnn.NMSBoxes(
              bboxes=boxes.tolist(),
              scores=scores.tolist(),
              score_threshold=conf_threshold,
              nms_threshold=iou_threshold,
          )
      
          # Draw detections
          for i in indices:
              i = i[0]  # OpenCV returns a list of lists
              x1, y1, x2, y2 = boxes[i]
              conf = scores[i]
              class_id = class_ids[i]
      
              # Draw bounding box
              cv2.rectangle(
                  image,
                  (int(x1), int(y1)),
                  (int(x2), int(y2)),
                  color=(0, 255, 0),
                  thickness=2,
              )
              # Put label
              label = f"{class_id}: {conf:.2f}"
              cv2.putText(
                  image,
                  label,
                  (int(x1), int(y1) - 10),
                  cv2.FONT_HERSHEY_SIMPLEX,
                  0.5,
                  (0, 255, 0),
                  2,
              )
      
          # Show the image
          cv2.imshow("Detections", image)
          cv2.waitKey(0)
          cv2.destroyAllWindows()
      else:
          print("No detections")

      Thanks,
      Jaka

      13 days later