Hello Luxonis community,

I've been working on a project where I want to run a YOLO segmentation model on the DepthAI camera and extract the masks from the YOLO model. I've used the ultralytics library to integrate the YOLO model with a video input, and then I overlay the predicted masks on the video frames.

Here's a brief overview of the steps I've taken:

  1. Loaded the YOLO model from a checkpoint.

  2. Opened a video file using OpenCV's VideoCapture.

  3. Iterated through the frames, and for each frame:

    • Converted the frame to a PIL image.

    • Predicted objects and masks using the YOLO model.

    • If masks were detected, drew them on the image.

  4. Wrote the processed frames to a new video file.

Here's a snippet of the code:

$$
import cv2
from ultralytics import YOLO
from PIL import Image, ImageDraw
from shapely.geometry import Polygon
import numpy as np

model = YOLO("best.pt")

Open video capture

cap = cv2.VideoCapture('video.mp4')

Get video properties

fourcc = cv2.VideoWriter_fourcc(*'XVID')
fps = int(cap.get(cv2.CAP_PROP_FPS))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

Initialize VideoWriter object

out = cv2.VideoWriter('video.avi', fourcc, fps, (width, height))

while cap.isOpened():
ret, frame = cap.read()
if not ret:
break

# Convert the frame to RGB PIL Image
img_pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

results = model.predict(img_pil, conf=0.85)
result = results[0]
masks = result.masks

# Check if we have masks before any conversion or drawing
if masks is not None:
    # Convert the image to RGBA for transparency support
    img_pil = img_pil.convert("RGBA")
    overlay = Image.new('RGBA', img_pil.size, (255, 255, 255, 0))
    overlay_draw = ImageDraw.Draw(overlay)

    for mask in masks:
        polygon = mask.xy[0]
        if len(polygon) >= 3:
            overlay_draw.polygon(polygon, outline=(0, 255, 0), fill=(0, 255, 0, 127))

            polygon_shapely = Polygon(polygon)
            centroid = polygon_shapely.centroid
            circle_radius = 5
            left_up_point = (centroid.x - circle_radius, centroid.y - circle_radius)
            right_down_point = (centroid.x + circle_radius, centroid.y + circle_radius)
            overlay_draw.ellipse([left_up_point, right_down_point], fill=(255, 0, 0))

    img_pil = Image.alpha_composite(img_pil, overlay)

frame = cv2.cvtColor(np.array(img_pil), cv2.COLOR_RGB2BGR)
out.write(frame)

Release the video objects

cap.release()
out.release()
$$

I would love to get some insights on:

  1. Is it possible to run this directly on the DepthAI camera to take advantage of its processing capabilities?

  2. Can the DepthAI SDK handle the mask overlay tasks directly?

Looking forward to any advice, suggestions, or relevant experiences you can share!

    jakaskerl

    That's a good starting point. However, my objective is to import the data points generated by the YOLO library into my code. Specifically, I need to extract the mask array from the YOLO output and then utilize it within my own codebase.

    To clarify my objective further, I've provided a visualization of the goal I aim to achieve below.

    • erik replied to this.

      Hi KoenvanWijlick ,
      Mid point mask, do you mean raw results that your model returns to you (undecoded)? Usually OAK camera just does decoding on the device itself, so you don't have to mess with that data.

        erik

        I'm working on extracting the mask array from my YOLO model. In the YOLO setup, I achieve this using the following code:
        results = model.predict(img_pil, conf=0.85)
        result = results[0]
        masks = result.masks

        Afterward, I process this array to determine the midpoint of the polygon formed by it. My goal is to relay this midpoint back to the camera, aiming to pinpoint the exact depth at that specific pixel or dimension. Fortunately, I've already set up the code needed to fetch the depth at a particular pixel. The current challenge is to retrieve the array from the camera.

        Any insights would be appreciated.

        Currently, this is the result I obtain with the array from the image captured by the camera. The red dot is pinpointed using a specific library in Python.

        • erik replied to this.

          Hi KoenvanWijlick ,
          So I believe this does mean undecoded/raw results. You can get those results via myYoloNode.outNetwork.link() (eg. link to XlinkOut to send those raw results to the host). This output provides NNData messages , which can hold multiple layers (if your model has multiple outputs, which yolo models do).
          Example for model that has 3 outputs, where rec is the NNData object; https://github.com/luxonis/depthai-experiments/blob/master/gen2-head-posture-detection/main.py#L21-L24

            erik

            How can I properly construct a layer within the YOLO architecture to extract the relevant data? The provided code offers some clarity on this matter.

            • erik replied to this.

              Hi KoenvanWijlick ,
              Your yolo model likely already provides these outputs, otherwise the code you shared above wouldn't work (I assume, I am not sure what model.predict(img_pil, conf=0.85) returns).

                erik

                "Thank you, Erik, for the help. I got it to work as intended. I chose a different route and used the API directly within a class, instead of the SDK."

                  2 months later

                  KoenvanWijlick Hi thank you for your post, I am trying to achieve something very similar. Can you please explain a bit further how did you achieve this using the API?

                  Do you think that it would be possible to run yolo object detection in the same frame? I managed to implement this on .mp4 videos in batches but am quite lost when trying to get these masks from the OAK-D device.

                  Would highly appreciate your ideas on this.
                  Regards

                    Companion

                    My post from September 21 accurately reflects my actions. As jakaskerl mentioned, the current converter does not support converting segmentation models for versions 7 and 8. Therefore, I am not using the onboard AI in this part of my code.

                    I managed to get this code functioning, but I decided to revert to the YOLO model, which focuses solely on object detection rather than segmentation. This approach proved to be more suitable for my use case after extensive research. If the converter is updated to include segmentation capabilities, I might consider switching to it.

                    Regarding your situation, here's what I did: First, I requested a photo from the camera. Then, I ran the YOLO model on the photo. After that, I used Python code to determine the midpoint mask. Following this, I requested depth data at that midpoint mask point from the camera and converted all the x, y, z coordinates into an array.

                    If you need additional assistance, whether it's code or other questions, feel free to ask.