Huge FPS Drop with Custom YOLOv8n Model vs Model Zoo

Mahesh

Description

I'm experiencing a significant performance issue when using a custom converted YOLOv8n model compared to the Luxonis model zoo version. The model zoo yolov6-nano runs at ~35 FPS, but my custom YOLOv8n (same nano size) only achieves 6-7 FPS using the exact same pipeline.

Setup:

Device: OAK-D (RVC2)
Custom model: YOLOv8n converted via HubAI SDK
Conversion settings: INT8 quantization, 640x640 input, GENERAL quantization data
The conversion completes successfully without errors

What I've tried:

Using the same pipeline code for both models
Verified the model is properly quantized (INT8)
Same confidence threshold (0.5)
Same input resolution (640x640)

I'm not sure what could be causing such a dramatic performance difference. Any insights would be greatly appreciated!

Conversion Script

$$
import os
from hubai_sdk import HubAIClient
from ultralytics import YOLO

client = HubAIClient(api_key=os.getenv("HUBAI_API_KEY"))
model = YOLO("models/yolov8l.pt")

class_names = []
for id, class_name in model.names.items():
class_names.append(class_name)

response = client.convert.RVC2(
path="models/yolov8n.pt",
name="quantized-yolov8n",
target_precision="INT8",
quantization_data="GENERAL",
yolo_input_shape=[640, 640],
yolo_class_names=class_names,
yolo_version="yolov8",
superblob=False
)
print(f"Converted model downloaded to: {response.downloaded_path}")
$$

Pipeline Code

$$
import time
import depthai as dai
import cv2
import numpy as np

def frameNorm(frame, bbox):
normVals = np.full(len(bbox), frame.shape[0])
normVals[::2] = frame.shape[1]
return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

device = dai.Device()

with dai.Pipeline(device) as pipeline:
print('Creating pipeline...')

cam = pipeline.create(dai.node.Camera) 
cam.build(dai.CameraBoardSocket.CAM_A, sensorResolution=(1352, 1012), sensorFps=52)

# Custom model (6-7 FPS)
nn_archive = dai.NNArchive('quantized-yolov8n-exported-to-target-rvc2/yolov8n.rvc2.tar.xz')

# Model zoo version (35 FPS)
# model_description = dai.NNModelDescription("luxonis/yolov6-nano:r2-coco-512x288")
# model_description.platform = device.getPlatformAsString()
# nn_archive = dai.NNArchive(dai.getModelFromZoo(model_description))

detection = pipeline.create(dai.node.DetectionNetwork).build(input=cam, nnArchive=nn_archive)
detection.setConfidenceThreshold(0.5)

videoQueue = detection.passthrough.createOutputQueue()
detectionQueue = detection.out.createOutputQueue()

labelMap = detection.getClasses()
frame = None
detections = []
startTime = time.monotonic()
counter = 0
color2 = (255, 255, 255)

def displayFrame(name, frame):
    color = (255, 0, 0)
    for detection in detections:
        bbox = frameNorm(
            frame,
            (detection.xmin, detection.ymin, detection.xmax, detection.ymax),
        )

        cv2.putText(
            frame,
            labelMap[detection.label],
            (bbox[0] + 10, bbox[1] + 20),
            cv2.FONT_HERSHEY_TRIPLEX,
            0.5,
            255,
        )

        cv2.putText(
            frame,
            f"{int(detection.confidence * 100)}%",
            (bbox[0] + 10, bbox[1] + 40),
            cv2.FONT_HERSHEY_TRIPLEX,
            0.5,
            255,
        )
        cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)

    cv2.imshow(name, frame)

print('Pipeline created.')
pipeline.start()

while pipeline.isRunning():
    videoIn = videoQueue.tryGet()
    img_detections = detectionQueue.tryGet()
    
    if videoIn is not None:
        frame = videoIn.getCvFrame()
        cv2.putText(frame,
                    f"NN fps: {counter / (time.monotonic() - startTime):.2f}",
            (2, frame.shape[0] - 4),
            cv2.FONT_HERSHEY_TRIPLEX,
            0.4,
            color2,
        )
    
    if img_detections:
        detections = img_detections.detections
        counter += 1
    
    if frame is not None:
        displayFrame("Detections", frame)
    
    if cv2.waitKey(1) == ord('q'):
        pipeline.stop()
        break

I'm trying to figure out whether the issue is with my conversion settings or my pipeline configuration. Any help identifying the root cause would be greatly appreciated! Also, if anyone has resources on optimizing either the model conversion process or the pipeline setup for better performance, I'd love to learn more.

KlemenSkrlj

Hi @Mahesh ,
the main issue of your comparison is that you are not comparing models with same input shape. Your YOLOv8n is 640x640 but the yolov6-nano, that you pull from Zoo, has input shape 512x288.
I took "base" YOLOv8 from Ultralytics, re-exported it with different input shapes and below are 1:1 comparisons:

YOLOv6 Nano (512x288): ~ 35FPS
YOLOv8 Nano (512x288): ~ 17FPS
YOLOv6 Nano (640x640): ~ 12FPS
YOLOv8 Nano (640x640): ~ 7FPS

So on average we could say that there is ~ 50% FPS difference between YOLOv6 and YOLOv8. This is why we recommend people to use YOLOv6 generally since the architecture itself is more suitable for edge devices and the accuracy difference between V6 and V8 is not that large. But at the end it depends on your specific usecase.

Small note on the quantization:
Models on RVC2 devices are using FP16 precision. You selecting INT8 through the API is ignored since this precision is not available for RVC2 conversion. We'll work on adding some warnings so this is more clear when you are using HubAI SDK.

Best,
Klemen

Mahesh

Thank you so much for the detailed explanation, Klemen! That clears things up completely. I really appreciate you taking the time to test the different input shapes and architectures to provide those concrete FPS comparisons.

I have a couple of follow-up questions:

1. Model Architecture Recommendations: Is there a resource or documentation where I can see which model architectures are more suitable for edge devices like the OAK-1? I'll also be working on segmentation models and would prefer to train architectures that offer comparably high accuracy but with good FPS performance. It would be helpful to know the recommended model families before I invest time in training.

2. Loading .blob files in DepthAI v3: Is there a way to load a .blob model file using DepthAI v3? Currently, I'm only able to use .tar.xz files to load my custom trained models. I'm wondering if there's support for the older blob format or if everything needs to be in the tar.xz format now.

Thanks again for your help!

KlemenSkrlj

I'm glad it helped.

Model Architecture Recommendations

For this I would refer you both to our Model Zoo with a nice collection of models and you can see benchmarks for them in each of their model cards.
And the second resource I would point out is our LuxonisTrain library. This is specifically aimed for training models that we know run well on our devices. And to make selection easier we have a set of predefined models about which you can read more about here. For segmentation specifically we have DDRNet backbone with a segmentation head as the predefined architecture and we have it benchmarked at 43FPS for light variant (input size: 512x384).

General note about the benchmark numbers that we list: This is the FPS we get if you only run neural-network on the device. The final FPS of the whole pipeline could be a bit lower depending on its complexity because of extra load on the device.

Loading .blob files in DepthAI v3

Yes, using .blob model with DepthAIv3 is possible. In this case you'll want to use .setModelPath() API like this:

model = pipeline.create(dai.node.NeuralNetwork)
model.setModelPath(<path/to/blob>)

For more information about this topic you can refer to the Inference documentation.

Best,
Klemen