Run Yolov8n on OpenVINO

LLetty · Sep 25, 2024

Hi all, maybe this is a bit of a general question but I thought I could find people with the specific knowledge here. I am currently using this yolov8n model from the DepthAI model zoo and I can run it fine on an OAK camera.

Now what I would like to do is have a script that runs it on a an OAK with a recorded video as input and, if no OAK is found, then run the model on the laptop CPU with OpenVINO.
I was able to do this with other models but I'm struggling with this specific yolo version because the OpenVINO demos only go up Yolov4 and when I try to plot the output the boxes take up most of the image, which makes me think that I might need to change the code from the OpenVINO python demos specifically for Yolov8.

Since the model was created as part of the DepthAI model zoo I was wondering if someone could give me some insight on this model or on how the output needs to be processed. Thanks in advance!

jakaskerl · Sep 26, 2024

Letty
Can you show the host side decoding you are using? This is generally not a depthai question (CPU running the interference) so GPT can probably help you a ton as well.

Thanks,
Jaka

LLetty · Sep 26, 2024

I know it's not strictly DepthAI but if I could get any help it would be very much appreciated!

I tried a few different things, but the last thing I tried was using the script from this answer https://stackoverflow.com/a/77532605/14098183 with the relevant modifications

import cv2
from openvino.runtime import Core

ie = Core()

model = ie.read_model(model="models/yolov8n_coco_640x352/yolov8n_coco_640x352.xml")
compiled_model = ie.compile_model(model=model, device_name="CPU")

input_layer_ir = compiled_model.input(0)
output_layer_ir = compiled_model.outputs
print(output_layer_ir)

image = cv2.imread("../test.jpg")
N, C, H, W = 1, 1, 352, 640
resized_image = cv2.resize(image, (W, H))
input_image = cv2.dnn.blobFromImage(resized_image, 1/255, (W, H), [0,0,0], 1, crop=False)
output = compiled_model([input_image])[output_layer_ir]
output = cv2.transpose(output[0])

but I encountered a problem, which is the fact that the DepthAI model doesn't return a single output but it's in three layers

[<ConstOutput: names[output3_yolov6r2] shape[1,85,11,20] type: f32>, <ConstOutput: names[output2_yolov6r2] shape[1,85,22,40] type: f32>, <ConstOutput: names[output1_yolov6r2] shape[1,85,44,80] type: f32>]

So what I'm not sure about now is how to process the results from these three layers to obtain the final detections.
I assume they are the layers related to the three different scales, but how do I scale the detection coordinates and sizes? I'm asking because I can find examples for Yolov3 but that had anchors, and Yolov8 shouldn't have anchors so I'm wondering how does DepthAI process the results

The only other option I'm thinking is that I could convert the yolov8n model from ultralytics and use that one, but I wanted to use the DepthAI one to make sure that the results from the laptop and an OAK camera are the same

jakaskerl · Sep 27, 2024

Hi Letty
GPT:

Understanding the Output Structure

Output Shapes:

•	Output1: [1, 85, 44, 80] (Large scale)
•	Output2: [1, 85, 22, 40] (Medium scale)
•	Output3: [1, 85, 11, 20] (Small scale)

Interpretation:

•	Batch Size: 1
•	Channels: 85 (This usually represents 5 bbox parameters + 80 class probabilities for COCO dataset)
•	Height & Width: Grid dimensions for that scale.

Grid Sizes:

•	Output1 (Large scale): Grid size 80 x 44
•	Output2 (Medium scale): Grid size 40 x 22
•	Output3 (Small scale): Grid size 20 x 11

Strides:

Given your input image size is 640 x 352, the strides can be calculated as:

•	Stride1: 8 (640 / 80)
•	Stride2: 16 (640 / 40)
•	Stride3: 32 (640 / 20)

import cv2
import numpy as np
from openvino.runtime import Core

# Initialize OpenVINO
ie = Core()
model = ie.read_model(model="models/yolov8n_coco_640x352/yolov8n_coco_640x352.xml")
compiled_model = ie.compile_model(model=model, device_name="CPU")

# Get input and output layers
input_layer_ir = compiled_model.input(0)
output_layers_ir = compiled_model.outputs

# Read and preprocess the image
image = cv2.imread("../test.jpg")
input_height, input_width = 352, 640
resized_image = cv2.resize(image, (input_width, input_height))
input_image = resized_image.transpose(2, 0, 1)  # HWC to CHW
input_image = input_image[np.newaxis, :] / 255.0  # Normalize and add batch dimension

# Run inference
outputs = compiled_model([input_image])

# Processing parameters
num_classes = 80
conf_threshold = 0.25
iou_threshold = 0.45

# For collecting all detections
all_detections = []

# Strides and grid sizes for each output layer
strides = [8, 16, 32]
output_shapes = {
    0: (44, 80),  # Output1: [1,85,44,80]
    1: (22, 40),  # Output2: [1,85,22,40]
    2: (11, 20),  # Output3: [1,85,11,20]
}

for idx, output_layer in enumerate(output_layers_ir):
    output = outputs[output_layer]
    grid_h, grid_w = output_shapes[idx]
    stride = strides[idx]

    # Reshape and permute the output to [batch, grid_h, grid_w, channels]
    output = output[0].transpose(1, 2, 0)
    output = output.reshape(-1, 85)

    # Apply sigmoid to the objectness score and class scores
    output[:, 4:] = 1 / (1 + np.exp(-output[:, 4:]))

    # Filter out low confidence detections
    objectness = output[:, 4]
    mask = objectness > conf_threshold
    filtered_output = output[mask]

    if filtered_output.size == 0:
        continue

    # Get coordinates, objectness, and class scores
    x = filtered_output[:, 0]
    y = filtered_output[:, 1]
    w = filtered_output[:, 2]
    h = filtered_output[:, 3]
    scores = filtered_output[:, 5:] * filtered_output[:, 4:5]

    # Get class IDs and scores
    class_ids = np.argmax(scores, axis=1)
    class_scores = scores[np.arange(len(scores)), class_ids]

    # Only keep detections with class score above threshold
    keep = class_scores > conf_threshold
    x = x[keep]
    y = y[keep]
    w = w[keep]
    h = h[keep]
    class_ids = class_ids[keep]
    class_scores = class_scores[keep]

    # Calculate positions on the original image
    grid_x, grid_y = np.meshgrid(np.arange(grid_w), np.arange(grid_h))
    grid_x = grid_x.flatten()[mask][keep]
    grid_y = grid_y.flatten()[mask][keep]

    # Decode bounding boxes
    x = (x + grid_x) * stride
    y = (y + grid_y) * stride
    w = np.exp(w) * stride
    h = np.exp(h) * stride

    # Convert to [x1, y1, x2, y2]
    x1 = x - w / 2
    y1 = y - h / 2
    x2 = x + w / 2
    y2 = y + h / 2

    # Append detections
    for i in range(len(x1)):
        detection = [x1[i], y1[i], x2[i], y2[i], class_scores[i], class_ids[i]]
        all_detections.append(detection)

# Convert to numpy array
all_detections = np.array(all_detections)

# Apply Non-Maximum Suppression
if len(all_detections) > 0:
    boxes = all_detections[:, :4]
    scores = all_detections[:, 4]
    class_ids = all_detections[:, 5].astype(int)

    # Perform NMS
    indices = cv2.dnn.NMSBoxes(
        bboxes=boxes.tolist(),
        scores=scores.tolist(),
        score_threshold=conf_threshold,
        nms_threshold=iou_threshold,
    )

    # Draw detections
    for i in indices:
        i = i[0]  # OpenCV returns a list of lists
        x1, y1, x2, y2 = boxes[i]
        conf = scores[i]
        class_id = class_ids[i]

        # Draw bounding box
        cv2.rectangle(
            image,
            (int(x1), int(y1)),
            (int(x2), int(y2)),
            color=(0, 255, 0),
            thickness=2,
        )
        # Put label
        label = f"{class_id}: {conf:.2f}"
        cv2.putText(
            image,
            label,
            (int(x1), int(y1) - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            (0, 255, 0),
            2,
        )

    # Show the image
    cv2.imshow("Detections", image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No detections")

Thanks,
Jaka

LLetty · Oct 10, 2024

Thanks a lot @jakaskerl!