How to run two inferences together on RVC2?

ShivamSharma

I have two models, first one is object detection YOLOv8, and the second one is mobilenetv2_100. The complicated part is that I want to run the classification model on the crop of the object detection model, and I am not able to do that because I keep getting errors. Is it possible to run these two models in sequence or concurrently on an RVC2 device? Can you give me some resource which has instructions on this.

So far, I have been able to run the following:


import depthai as dai
from depthai_nodes.node import ParsingNeuralNetwork, ImgDetectionsBridge
DEVICE = "169.254.1.222" # Set to None to use the default device, or you can specify a specific device IP
device = dai.Device(dai.DeviceInfo(DEVICE)) if DEVICE else dai.Device()
platform = device.getPlatform()
img_frame_type = dai.ImgFrame.Type.BGR888i if platform.name == "RVC4" else dai.ImgFrame.Type.BGR888p
visualizer = dai.RemoteConnection(httpPort=8082)

with dai.Pipeline(device) as pipeline:
    cam = pipeline.create(dai.node.Camera).build()
    nn_archive = dai.NNArchive(MODEL_PATH)
    # Create the neural network node
    nn_with_parser = pipeline.create(ParsingNeuralNetwork).build(
        cam.requestOutput((512, 288), type=img_frame_type, fps=30), 
        nn_archive
    )

    # Bridge the detections to the visualizer
    label_encoding = {k: v for k, v in enumerate(nn_archive.getConfig().model.heads[0].metadata.classes)}
    bridge = pipeline.create(ImgDetectionsBridge).build(nn_with_parser.out)
    bridge.setLabelEncoding(label_encoding)

    # Configure the visualizer node
    visualizer.addTopic("Video", nn_with_parser.passthrough, "images")
    visualizer.addTopic("Detections", bridge.out, "detections")

    pipeline.start()
    visualizer.registerPipeline(pipeline)
    
    while pipeline.isRunning():
        key = visualizer.waitKey(1)
        if key == ord("q"):
            print("Got q key from the remote connection!")
            break

KlemenSkrlj

Hi @ShivamSharma ,
You can check out oak-examples to see how we have already implemented something similar. In your case you can check out the human-pose example which runs object detection and then feeds crops into a second stage pose estimation NN (you would change this for your classification NN).
Best,
Klemen

ShivamSharma

KlemenSkrlj Thank you for the response. The ReadMe for this repo says that for real time application it will only work on RVC4. I tried the following on RVC2 and I only saw detection. Also, I saw sometimes the Jupyter Notebook's kernel will crash which indicates to me that the device is not capable of running both models at the same time.:

import depthai as dai
from depthai_nodes.node import (
    ParsingNeuralNetwork,
    ImgDetectionsBridge,
    GatherData,
    ImgDetectionsFilter,
)

# Your model paths
DETECTION_MODEL_PATH = r"C:\Users\ssharm21\depthai-core\depthai-ml-training\conversion\best.rvc2.tar.xz"
CLASSIFICATION_MODEL_PATH = r"C:\Users\ssharm21\depthai-core\depthai-ml-training\conversion\MobileNetV2 RVC2 Compatible Attribute Classifier.rvc2.tar.xz"

DEVICE = "169.254.1.222"
PADDING = 0.1  # Add padding around detected objects

device = dai.Device(dai.DeviceInfo(DEVICE)) if DEVICE else dai.Device()
platform = device.getPlatform()
img_frame_type = dai.ImgFrame.Type.BGR888i if platform.name == "RVC4" else dai.ImgFrame.Type.BGR888p
visualizer = dai.RemoteConnection(httpPort=8082)

print(f"Platform: {platform.name}")

with dai.Pipeline(device) as pipeline:
    print("Creating detection + classification pipeline...")

    # === CAMERA ===
    cam = pipeline.create(dai.node.Camera).build()
    camera_out = cam.requestOutput((512, 288), type=img_frame_type, fps=15)

    # === DETECTION MODEL ===
    detection_archive = dai.NNArchive(DETECTION_MODEL_PATH)
    detection_nn = pipeline.create(ParsingNeuralNetwork).build(
        camera_out,
        detection_archive
    )
    detection_nn.input.setBlocking(False)
    detection_nn.input.setMaxSize(1)

    # Detection bridge
    detection_label_encoding = {k: v for k, v in enumerate(detection_archive.getConfig().model.heads[0].metadata.classes)}
    detection_bridge = pipeline.create(ImgDetectionsBridge).build(detection_nn.out)
    detection_bridge.setLabelEncoding(detection_label_encoding)

    # === OPTIONAL: FILTER SPECIFIC CLASSES ===
    # If you want to classify only specific detected objects
    # valid_labels = [0, 1, 2]  # Bear_nest, Lot_box, cassette - adjust as needed
    # detections_filter = pipeline.create(ImgDetectionsFilter).build(
    #     detection_nn.out, labels_to_keep=valid_labels
    # )

    # === SCRIPT NODE FOR CROPPING ===
    script_node = pipeline.create(dai.node.Script)
    detection_nn.out.link(script_node.inputs["det_in"])
    detection_nn.passthrough.link(script_node.inputs["preview"])

    # Script to generate crop configurations
    script_content = f"""
import time

def generate_crops(detections, img_width, img_height, target_width, target_height, padding):
    crops = []
    for detection in detections.detections:
        # Get bounding box
        x1 = max(0, detection.xmin - padding)
        y1 = max(0, detection.ymin - padding) 
        x2 = min(1, detection.xmax + padding)
        y2 = min(1, detection.ymax + padding)
        
        # Create crop config
        cfg = ImageManipConfig()
        cfg.setCropRect(x1, y1, x2, y2)
        cfg.setResize({224}, {224})  # Classification model input size
        crops.append(cfg)
    
    return crops

while True:
    try:
        detections = node.io['det_in'].get()
        frame = node.io['preview'].get()
        
        if detections is not None and frame is not None:
            crops = generate_crops(
                detections, 
                frame.getWidth(), 
                frame.getHeight(),
                224, 224,  # Classification input size
                {PADDING}
            )
            
            # Send crops one by one
            for i, crop_cfg in enumerate(crops):
                node.io['manip_cfg'].send(crop_cfg)
                node.io['manip_img'].send(frame)
                
    except:
        pass
"""

    script_node.setScript(script_content)

    # === IMAGE CROPPER ===
    crop_node = pipeline.create(dai.node.ImageManip)
    crop_node.initialConfig.setOutputSize(224, 224)  # Classification input size
    crop_node.inputConfig.setWaitForMessage(True)

    script_node.outputs["manip_cfg"].link(crop_node.inputConfig)
    script_node.outputs["manip_img"].link(crop_node.inputImage)

    # === CLASSIFICATION MODEL ===
    classification_archive = dai.NNArchive(CLASSIFICATION_MODEL_PATH)
    classification_nn = pipeline.create(ParsingNeuralNetwork).build(
        crop_node.out,
        classification_archive
    )
    classification_nn.input.setBlocking(False)
    classification_nn.input.setMaxSize(1)

    # === SYNCHRONIZATION ===
    # Sync classification results with detections
    gather_data_node = pipeline.create(GatherData).build(camera_fps=15)
    classification_nn.out.link(gather_data_node.input_data)
    detection_bridge.out.link(gather_data_node.input_reference)

    # === VISUALIZATION ===
    visualizer.addTopic("Video", detection_nn.passthrough, "images")
    visualizer.addTopic("Detections", detection_bridge.out, "detections")
    visualizer.addTopic("Cropped_Objects", crop_node.out, "images")
    visualizer.addTopic("Classifications", classification_nn.out, "classifications")
    visualizer.addTopic("Synced_Results", gather_data_node.out, "detections")

    print(f"Detection classes: {detection_label_encoding}")
    print(f"Classification classes: {list(classification_archive.getConfig().model.heads[0].metadata.classes)}")
    print("Open http://localhost:8082 to view results")

    pipeline.start()
    visualizer.registerPipeline(pipeline)

    while pipeline.isRunning():
        key = visualizer.waitKey(1)
        if key == ord("q"):
            print("Got q key. Exiting...")
            break

KlemenSkrlj

Could you elaborite a bit more what you mean by you don't see detections? You don't see them on the visualizer or also nothing is printed by print command I see in your code?

The realtime only on RVC4 comment is explicitly for that example. This is dependent on which 2 models you have and how compute intensive are together. Best way is to try and see with your combination.

ShivamSharma

KlemenSkrlj I apologize for being unclear. I changed the code similar to what the pose example has and kept the fps as low as 1.25 but the python kernel was still crashing. So, I guess it means that I need more memory efficient model.

KlemenSkrlj

I would suggest to try a smaller classification model yes. Or if you can optimize the pipeline in a way that you only have the detection model and use the classes predicted for specific bounding boxes. This way you have only 1 model instead of 2. But it doesn't work for all cases, maybe just something to consider.

ShivamSharma

KlemenSkrlj Yes, I had some wrong data in the NN archive configuration that was causing the kernel crash. I was able to run both, but it looks like the classification parser and head does not have confidence threshold variable. Also, I am still trying to figure out the Annotation class for visualization. I am using YOLOv8, and TinyNet-A and they are working okay at 5fps.

Edit 1: I was able to make it work with annotations. But I would still like to know if there is some way to set confidence threshold in classification on device.

Edit 2: I know it is a little off topic, but do you know how can I avoid the confidence or prediction in the classification from fluctuating? In a video its confidence fluctuates a lot but based on my testing on images it performs good on images. Is it possible to capture only one image and run inference on device. How would I go about that?

KlemenSkrlj

Edit 1: I was able to make it work with annotations. But I would still like to know if there is some way to set confidence threshold in classification on device.

You can create a custom host node, link parsed classification messages into it and perform any other postprocessing (like filtering based on confidence)

Edit 2: I know it is a little off topic, but do you know how can I avoid the confidence or prediction in the classification from fluctuating? In a video its confidence fluctuates a lot but based on my testing on images it performs good on images.

This largely depends on how well the model is trained and how generalizable it is. If detections are jumping than it might be that it is too fine-tunned to your training set and lacks some harder example that happen in real world. You can also play around with augmentations to see if you can get a more generalizable model without needing to collect more data - combination of both usually gives the best results

Is it possible to capture only one image and run inference on device. How would I go about that?

You can create an output queue on the stream you want to get the image from. And then in the while pipeline.isRunning() block you get he ImgFrame, get its CvFrame (.getCvFrame()) and save it. For testing on this frame the process is similar but in reverse: You crete an inputQueue and then read the image, create a new ImgFrame, set the data (.setCvFrame()) and send it to the input queue.