Hi @erik and everyone,

I trained a ResNet18 model with PyTorch to classify excavator actions and converted it to a .blob file for the OAK D camera. However, its accuracy was lower on the camera than on the host.

This is a apart of script running on my OAK D camera:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:
        resized = cv2.resize(arr, shape)
        return resized.transpose(2, 0, 1).flatten()
             
# Pipeline defined, now the device is assigned and pipeline is started
with dai.Device(pipeline) as device:
        
    # Input queue will be used to send video frames to the device.
    qIn = device.getInputQueue(name="inFrame")
    # Output queue will be used to get nn data from the video frames.
    qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False) # maxSize=6, blocking=True)
    qPass = device.getOutputQueue("pass")

    frame = None
    result = None
            
    cap = cv2.VideoCapture(videoPath)
    while cap.isOpened():
        read_correctly, frame = cap.read()
        if not read_correctly:
            break
        
        frame_planar = to_planar(frame, (224, 224))
            
        # Create a dai.ImgFrame and send to the device
        img = dai.ImgFrame()
        img.setType(dai.RawImgFrame.Type.BGR888p) 
        img.setSize(224, 224)
        img.setData(frame_planar)
        qIn.send(img)
    
        inDet = qDet.tryGet()

        if inDet is not None:
            data = softmax(inDet.getFirstLayerFp16())
            result_conf = np.max(data)
            if result_conf > 0.2:
                result = {
                    "name": labels[np.argmax(data)],
                    "conf": round(100 * result_conf, 2)
                }
            else:
                result = None
            
            frame_main = qPass.get().getCvFrame()
            if result is not None:
                cv2.putText(frame_main, "{}".format(result["name"]), (5, 10), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0, 0, 255))
                cv2.putText(frame_main, "{}%".format(result["conf"]), (5, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0, 0, 255))
                
            cv2.imshow("passthrough", cv2.resize(frame_main, (224, 224)))

And this is how I run the model on my host (without using OAK D camera):

cap = cv2.VideoCapture(video_path)
while cap.isOpened():
    ret, frame = cap.read()
    
    if not ret:
        break
    
    # Convert frame to RGB for model inference
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    
    # Resize the frame to 224x224
    frame_resized = cv2.resize(frame_rgb, (224, 224))
    
    # Normalize the image and add the batch dimension
    img_tensor = torch.from_numpy(frame_resized / 255.0).permute(2, 0, 1).float().unsqueeze(0)
    
    # Inference with the ResNet18 model
    with torch.no_grad():
        outputs = model(img_tensor)
        probs = F.softmax(outputs, dim=1)
        
    # Get class with highest confidence
    confidences, class_idx = probs.squeeze(0).max(0)
    label = f'{class_names[class_idx]} {confidences:.2f}'
    
    # Draw class and confidence on the frame
    cv2.putText(frame_resized, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)
    
    # Convert back to BGR for displaying
    frame_display = cv2.cvtColor(frame_resized, cv2.COLOR_RGB2BGR)
    
    # Display the frame
    cv2.imshow('Video', frame_display)

I ran the model on both OAK D and my host. While both worked, the OAK D's results were significantly less accurate. Could the difference in neural network processing be the cause?

I trained a YOLOV5s model and observed better accuracy on the host than on OAK D. This suggests the issue might not be with the models themselves.

I've been confused about this quite long, and just want the NN node on my OAK D can generate equally accurate results.

Glad to know your thoughts!

Regards,

Austin

  • erik and Matija replied to this.
    • Best Answerset by YWei

    Hi Majita Matija

    Thank you, I think I already found the problem root: when using OpenCV function to capture frames from videos, it just changed the colour channel order of frames. So I merely applied a code line like this to solve it:

    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    It's tricky but easy to troubleshoot, so no need for doing more complicated processing like normalization or scaling I bet 🙂

    Thanks,

    Austin

    Hi YWei ,
    It might be due to quantization differences - (RVC2) OAK cameras only support FP16. Perhaps it would be best to try using OPenVINOs inference engine for FP16 inference, to cross-check whether that's the issue.
    More docs here:

    YWei

    Can you also provide the mean values and scale values you use when exporting ONNX to blob for both models (ResNet18 and YoloV5cls mentioned in another thread)? It's likely that the post-processing is different for both.

    Best,
    Matija

    • YWei replied to this.

      Hi Matija Matija

      When converting model files from ONNX to .blob, I set it this way as recommended in link:

      --data_type=FP16 --mean_values=[0,0,0] --scale_values=[255,255,255]

      To add, when converting my ResNet18 model to ONNX, I ran this code block:

      import torch
      import torchvision
      
      import torch.nn as nn
      import torch.optim as optim
      from torchvision import datasets, models, transforms
      from torch.utils.data import Dataset
      from torch.optim import lr_scheduler
      from torch.utils.data import random_split, DataLoader
      
      dummy_input = torch.randn(1, 3, 224, 224, device='cpu')  # adjust as necessary
      torch.onnx.export('best.pt', dummy_input, "best.onnx")

      And I ran this colab script for converting my YOLOV5-cls model to an ONNX.

      I am not sure if such operations also cause the problem?

      Cheers,

      Austin

        YWei

        When you are training, do you normalize the images? You would have to find how image is transformed after reading and before being passed to the model during the training. I would assume that image is read as a Tensor with values between 0 and 1, and then normalized using ImageNet mean and scale (mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]). Since images on OAK come with values in range [0-255], you have to multiply this by 255, and then set it like:

        --mean_values [123.675,116.28,103.53]
        --scale_values [58.395,57.12,57.375]

        Also, they are likely read in RGB, while on OAK they are passed in BGR if you haven't specifically specified it to be RGB. This means that you will additionally have to use the --reverse_input_channels flag.

        For YoloCLS, just using this flag and then scale and mean as you currently use should suffice.

        • YWei replied to this.

          Hi Majita Matija

          Thank you, I think I already found the problem root: when using OpenCV function to capture frames from videos, it just changed the colour channel order of frames. So I merely applied a code line like this to solve it:

          frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

          It's tricky but easy to troubleshoot, so no need for doing more complicated processing like normalization or scaling I bet 🙂

          Thanks,

          Austin