Preserving FPS

RaveenaJadhav

Hi there, I am using yolo detection network of nn size 320 320 for better fps. I am cropping bigger image by superimposing detected bbox cordinates from NN to bigger window size. My fps drastically reduces. Is there any way to preserve fps and crop from higher window size?

jakaskerl

Hi @RaveenaJadhav,
I'm not really sure what you are doing that would cause a drastic FPS decrease. Could you please send the code so we can debug?

Thanks,
Jaka

RaveenaJadhav

Hey , here is the code that I am using. Using the Yolotiny and oak-1 camera for the detection. Hope this helps.

from pathlib import Path
import sys
import cv2
import depthai as dai
import numpy as np
import time

labelMap = ["barcode"]

syncNN = True

# Create pipeline
pipeline = dai.Pipeline()

# Define sources and outputs
camRgb = pipeline.create(dai.node.ColorCamera)
detectionNetwork = pipeline.create(dai.node.YoloDetectionNetwork)
xoutRgb = pipeline.create(dai.node.XLinkOut)
nnOut = pipeline.create(dai.node.XLinkOut)

xoutRgb.setStreamName("rgb")
nnOut.setStreamName("nn")

# Properties
camRgb.setPreviewSize(320, 320)
camRgb.setResolution(dai.ColorCameraProperties.SensorResolution.THE_4_K)
camRgb.setInterleaved(False)
camRgb.setColorOrder(dai.ColorCameraProperties.ColorOrder.BGR)
camRgb.setFps(40)

# Network specific settings
detectionNetwork.setConfidenceThreshold(0.5)
detectionNetwork.setNumClasses(1)
detectionNetwork.setCoordinateSize(4)
detectionNetwork.setAnchors(
[23.09375, 6.93359375, 16.28125, 33.5625, 35.5, 24.859375, 71.25, 17.328125, 39.59375, 32.125, 52.25, 36.125,
70.875, 33.5625, 65.5, 48.9375, 88.5625, 63.75])
detectionNetwork.setAnchorMasks({"side40": [0, 1, 2], "side20": [3, 4, 5], "side10": [6, 7, 8]})
detectionNetwork.setBlobPath('model/barcode.blob')
detectionNetwork.setIouThreshold(0.5)
detectionNetwork.setNumInferenceThreads(2)
detectionNetwork.input.setBlocking(False)
camRgb.setVideoSize(960, 960)

# Linking
camRgb.preview.link(detectionNetwork.input)
if syncNN:
detectionNetwork.passthrough.link(xoutRgb.input)
else:
camRgb.preview.link(xoutRgb.input)

detectionNetwork.out.link(nnOut.input)

# Connect to device and start pipeline
with dai.Device(pipeline) as device:

# Output queues will be used to get the rgb frames and nn data from the outputs defined above
qRgb = device.getOutputQueue(name="rgb", maxSize=250, blocking=False)
qDet = device.getOutputQueue(name="nn", maxSize=250, blocking=False)

frame = None
detections = []
startTime = time.monotonic()
counter = 0
color2 = (255, 255, 255)

# nn data, being the bounding box locations, are in <0..1> range - they need to be normalized with frame width/height
def frameNorm(frame, bbox):
    normVals = np.full(len(bbox), frame.shape[0])
    normVals[::2] = frame.shape[1]
    return (np.clip(np.array(bbox), 0, 1) \* normVals).astype(int)

def displayFrame(name, frame):
    color = (255, 0, 0)
    global count
    for detection in detections:
        bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
        cv2.putText(frame, labelMap[detection.label], (bbox[0] + 10, bbox[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
        cv2.putText(frame, f"{int(detection.confidence \* 100)}%", (bbox[0] + 10, bbox[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
        cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)
        count += 1
        if count == 30:
            a = str(deviceName) + '-' + datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S%f')
            cv2.imwrite(str(pathC) + "/data/small/" + str(a) + str(dtypesmall) + '.jpg', crop_frame)
    # Show the frame
    cv2.imshow(name, frame)

while True:
    if syncNN:
        inRgb = qRgb.get()
        inDet = qDet.get()
    else:
        inRgb = qRgb.tryGet()
        inDet = qDet.tryGet()

    if inRgb is not None:
        frame = inRgb.getCvFrame()
        cv2.putText(frame, "NN fps: {:.2f}".format(counter / (time.monotonic() - startTime)),
                    (2, frame.shape[0] - 4), cv2.FONT_HERSHEY_TRIPLEX, 0.4, color2)

    if inDet is not None:
        detections = inDet.detections
        counter += 1

    if frame is not None:
        displayFrame("rgb", frame)

    if cv2.waitKey(1) == ord('q'):
        break

jakaskerl

Hi RaveenaJadhav,

I cannot see where the crop_frame was defined. Could you please add that as well?

Thanks,
Jaka

RaveenaJadhav

So, the crop_frame is defined inside the detection loop (which can be found in displayFrame function) . This is the updated the loop

for detection in detections:
bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)
crop_frame = frame[bbox[1]:bbox[3], bbox[0]:bbox[2]]
count += 1

jakaskerl

Hi RaveenaJadhav,
Try compiling your blob for more shaves. It should help with throughput at 4K which seems to be the main problem. Source.

Let me know if it helps,
Jaka

RaveenaJadhav

Sure, will try that and let you know

purdonjack

Watching this one - had exactly the same problem Raveena.

maxsitt

Hi @RaveenaJadhav, how big is the difference in fps with and without cropping from high-resolution frames?

I'm using a similar pipeline, running YOLOv5n detection on downscaled 320x320 preview and cropping the detections from 1920x1080 frames. The model can run at up to 49 fps on the 320x320 frames, but fps drop to 12 when cropping detections from the 1080p frames (3 fps when cropping from 4K).

You are setting video size to 960x960 (camRgb.setVideoSize(960, 960)), are these your HQ frames from which you want to crop the detections? Then you could set sensor resolution to 1080p or use cam_rgb.setIspScale() to downscale 4K, which will lead to a better image quality at the same resolution (ISP scaling options). However I'm not sure if this could increase fps in your case. Finding the best number of shaves for your model and setting camera fps to the optimum will likely increase your pipeline speed (try decreasing cam fps from 40 to e.g. 20).

I would also be very interested if there are more ways to speed up inference speed when cropping (+ syncing) detections made on LQ frames from HQ frames.

RaveenaJadhav

cam_rgb.setIspScale() this significantly affects the detection. I have compile on 13 shaves (maximum) but still having the same performance. I want to crop from more than 960x960 window size but fps goes down really quickly. IF you are having syncing problem i would recommend you to play around with qRgb = device.getOutputQueue(name="rgb", maxSize=250, blocking=False).

Priyam

Hi I am having the same issue. I am using Yolo tinyv7 and trying to crop from bigger window size as I am detecting small labels. Clear images would help me in ocr. Any suggestion let me know , thanks in advance

jakaskerl

Hi @RaveenaJadhav,
I did some testing, it seems to me that cropping isn't the cause of low fps. I used the tiny yolo example which is very similar to your code, but instead of your barcode model, I used tiny yolo v4. I also added your logic for cropping.

Initial observations/tests:
Setting the color camera sensor resolution to 1080p gives me around 30FPS.
Setting the color camera sensor resolution to 4K gives me around 19-20FPS.

Changing crop and resizes didn't affect the end performance and it also didn't really make sense to me since both the neural network as well as xlinkout are getting the same resolution image (set by previewSize) in both cases.
So I did some debugging and it seems like the drop stems from resource management

Note that cmx/shaves values seem to be switched

1080P 
ColorCamera allocated resources: no shaves; cmx slices: [13-15]  
ImageManip allocated resources: shaves: [15-15] no cmx slices 
NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12] 
Inference thread count: 2, number of shaves allocated per thread: 6  
4K 
ColorCamera allocated resources: no shaves; cmx slices: [10-15]  
ImageManip allocated resources: shaves: [15-15] no cmx slices 
NeuralNetwork allocated resources: shaves: [0-9] cmx slices: [0-9] 
Inference thread count: 1, number of shaves allocated per thread: 6

As you can see, the 4K camera resolution uses 6 shaves (confirmed by docs). Which leaves NeuralNetwork node to only have 10 shaves available, which for a model compiled for 6 shaves or more, forces the node to run on a single inference thread.
On the other hand, at 1080P (3 shaves used by the camera), 13 shaves are available to NeuralNetwork node, which is enough to allow it to run with 2 inference threads (2 threads * 6 Shaves = 12 shaves) which in turn allows for more FPS.

So after compiling the model for 5 shaves, the model now runs on 2 inference threads and the result framerate at 4K is 26FPS.

Hope this helps in your case as well 🙂
Jaka

RaveenaJadhav

Thanks , compiling model for 5 shave did increase the Fps. So , i have trained model for 320*320 image size. If i set videosize to 320*320 i acheive 30 fps but for our purpose we want set windowsize to greater than 960 *960 window size which reduces fps to 13. As i increase the windowsize fps further reduces.

jakaskerl

Hi RaveenaJadhav
Hmm, that's weird, video size had no effect on FPS on my side at all. Is FPS decreasing linearly with increasing video size, or is there a point when it starts bottlenecking?

Thanks,
Jaka

RaveenaJadhav

It's decreasing linearly.

FYI- we are using Raspberry Pi as host, however, even on local device it drops linearly

Timmm

Firstly I thought that hardware(host) might be causing bottleneck in the performance. But I used my personal laptop which is pretty high spec. Still my fps decrease as I increase camRgb.setVideoSize(960, 960). Is there a problem in code or pipeline? any leads?

Timmm

@RaveenaJadhav Can you try changing USB cable to 3.0 which supports more data transmission?

Timmm

@jakaskerl I have changed USB cable. I noticed it was showing USB.HIGH instead of USB.SUPER. By replacing USB cable FPS increase to 35 for yolotiny at 960x960 window size. At 1080x1080 window size fps starts to decrease by 8-9. Any leads?

jakaskerl

Hi Timmm
Could you try running DEPTHAI_LEVEL DEBUG python3 *filename* and pasting results for shaves and cmx slices used. It looks to me like the connection bottlenecks at _770Mbit/s (960x960x3x35x8 ~ 1080x1080x3x27x8).

Thanks,
Jaka