Running a YOLO model with segmentation masks on DepthAI

KoenvanWijlick · Sep 21, 2023

To clarify my objective further, I've provided a visualization of the goal I aim to achieve below.

erik · Sep 21, 2023

Hi KoenvanWijlick ,
Mid point mask, do you mean raw results that your model returns to you (undecoded)? Usually OAK camera just does decoding on the device itself, so you don't have to mess with that data.

KoenvanWijlick · Sep 22, 2023

erik

I'm working on extracting the mask array from my YOLO model. In the YOLO setup, I achieve this using the following code:
results = model.predict(img_pil, conf=0.85) result = results[0] masks = result.masks

Afterward, I process this array to determine the midpoint of the polygon formed by it. My goal is to relay this midpoint back to the camera, aiming to pinpoint the exact depth at that specific pixel or dimension. Fortunately, I've already set up the code needed to fetch the depth at a particular pixel. The current challenge is to retrieve the array from the camera.

Any insights would be appreciated.

KoenvanWijlick · Sep 22, 2023

Currently, this is the result I obtain with the array from the image captured by the camera. The red dot is pinpointed using a specific library in Python.

erik · Sep 22, 2023

Hi KoenvanWijlick ,
So I believe this does mean undecoded/raw results. You can get those results via myYoloNode.outNetwork.link() (eg. link to XlinkOut to send those raw results to the host). This output provides NNData messages , which can hold multiple layers (if your model has multiple outputs, which yolo models do).
Example for model that has 3 outputs, where rec is the NNData object; https://github.com/luxonis/depthai-experiments/blob/master/gen2-head-posture-detection/main.py#L21-L24

KoenvanWijlick · Sep 22, 2023

erik

How can I properly construct a layer within the YOLO architecture to extract the relevant data? The provided code offers some clarity on this matter.

erik · Sep 22, 2023

Hi KoenvanWijlick ,
Your yolo model likely already provides these outputs, otherwise the code you shared above wouldn't work (I assume, I am not sure what model.predict(img_pil, conf=0.85) returns).

KoenvanWijlick · Sep 25, 2023

erik

"Thank you, Erik, for the help. I got it to work as intended. I chose a different route and used the API directly within a class, instead of the SDK."

erik · Sep 25, 2023

Sounds good, thanks for letting us know!

Companion · Nov 17, 2023

KoenvanWijlick Hi thank you for your post, I am trying to achieve something very similar. Can you please explain a bit further how did you achieve this using the API?

Do you think that it would be possible to run yolo object detection in the same frame? I managed to implement this on .mp4 videos in batches but am quite lost when trying to get these masks from the OAK-D device.

Would highly appreciate your ideas on this.
Regards

KoenvanWijlick · Nov 19, 2023

Companion

My post from September 21 accurately reflects my actions. As jakaskerl mentioned, the current converter does not support converting segmentation models for versions 7 and 8. Therefore, I am not using the onboard AI in this part of my code.

I managed to get this code functioning, but I decided to revert to the YOLO model, which focuses solely on object detection rather than segmentation. This approach proved to be more suitable for my use case after extensive research. If the converter is updated to include segmentation capabilities, I might consider switching to it.

Regarding your situation, here's what I did: First, I requested a photo from the camera. Then, I ran the YOLO model on the photo. After that, I used Python code to determine the midpoint mask. Following this, I requested depth data at that midpoint mask point from the camera and converted all the x, y, z coordinates into an array.

If you need additional assistance, whether it's code or other questions, feel free to ask.