Custom classification model can't get accurate results

YWei · Sep 11, 2023

Hi @erik and everyone,

I trained a ResNet18 model with PyTorch to classify excavator actions and converted it to a .blob file for the OAK D camera. However, its accuracy was lower on the camera than on the host.

This is a apart of script running on my OAK D camera:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:
        resized = cv2.resize(arr, shape)
        return resized.transpose(2, 0, 1).flatten()
             
# Pipeline defined, now the device is assigned and pipeline is started
with dai.Device(pipeline) as device:
        
    # Input queue will be used to send video frames to the device.
    qIn = device.getInputQueue(name="inFrame")
    # Output queue will be used to get nn data from the video frames.
    qDet = device.getOutputQueue(name="nn", maxSize=4, blocking=False) # maxSize=6, blocking=True)
    qPass = device.getOutputQueue("pass")

    frame = None
    result = None
            
    cap = cv2.VideoCapture(videoPath)
    while cap.isOpened():
        read_correctly, frame = cap.read()
        if not read_correctly:
            break
        
        frame_planar = to_planar(frame, (224, 224))
            
        # Create a dai.ImgFrame and send to the device
        img = dai.ImgFrame()
        img.setType(dai.RawImgFrame.Type.BGR888p) 
        img.setSize(224, 224)
        img.setData(frame_planar)
        qIn.send(img)
    
        inDet = qDet.tryGet()

        if inDet is not None:
            data = softmax(inDet.getFirstLayerFp16())
            result_conf = np.max(data)
            if result_conf > 0.2:
                result = {
                    "name": labels[np.argmax(data)],
                    "conf": round(100 * result_conf, 2)
                }
            else:
                result = None
            
            frame_main = qPass.get().getCvFrame()
            if result is not None:
                cv2.putText(frame_main, "{}".format(result["name"]), (5, 10), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0, 0, 255))
                cv2.putText(frame_main, "{}%".format(result["conf"]), (5, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0, 0, 255))
                
            cv2.imshow("passthrough", cv2.resize(frame_main, (224, 224)))

And this is how I run the model on my host (without using OAK D camera):

cap = cv2.VideoCapture(video_path)
while cap.isOpened():
    ret, frame = cap.read()
    
    if not ret:
        break
    
    # Convert frame to RGB for model inference
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    
    # Resize the frame to 224x224
    frame_resized = cv2.resize(frame_rgb, (224, 224))
    
    # Normalize the image and add the batch dimension
    img_tensor = torch.from_numpy(frame_resized / 255.0).permute(2, 0, 1).float().unsqueeze(0)
    
    # Inference with the ResNet18 model
    with torch.no_grad():
        outputs = model(img_tensor)
        probs = F.softmax(outputs, dim=1)
        
    # Get class with highest confidence
    confidences, class_idx = probs.squeeze(0).max(0)
    label = f'{class_names[class_idx]} {confidences:.2f}'
    
    # Draw class and confidence on the frame
    cv2.putText(frame_resized, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)
    
    # Convert back to BGR for displaying
    frame_display = cv2.cvtColor(frame_resized, cv2.COLOR_RGB2BGR)
    
    # Display the frame
    cv2.imshow('Video', frame_display)

I ran the model on both OAK D and my host. While both worked, the OAK D's results were significantly less accurate. Could the difference in neural network processing be the cause?

I trained a YOLOV5s model and observed better accuracy on the host than on OAK D. This suggests the issue might not be with the models themselves.

I've been confused about this quite long, and just want the NN node on my OAK D can generate equally accurate results.

Glad to know your thoughts!

Regards,

Austin

erik · Sep 11, 2023

Hi YWei ,
It might be due to quantization differences - (RVC2) OAK cameras only support FP16. Perhaps it would be best to try using OPenVINOs inference engine for FP16 inference, to cross-check whether that's the issue.
More docs here:

Matija · Sep 11, 2023

YWei

Can you also provide the mean values and scale values you use when exporting ONNX to blob for both models (ResNet18 and YoloV5cls mentioned in another thread)? It's likely that the post-processing is different for both.

Best,
Matija

YWei · Sep 11, 2023

Hi Matija Matija

When converting model files from ONNX to .blob, I set it this way as recommended in link:

--data_type=FP16 --mean_values=[0,0,0] --scale_values=[255,255,255]

To add, when converting my ResNet18 model to ONNX, I ran this code block:

import torch
import torchvision

import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, models, transforms
from torch.utils.data import Dataset
from torch.optim import lr_scheduler
from torch.utils.data import random_split, DataLoader

dummy_input = torch.randn(1, 3, 224, 224, device='cpu')  # adjust as necessary
torch.onnx.export('best.pt', dummy_input, "best.onnx")

And I ran this colab script for converting my YOLOV5-cls model to an ONNX.

I am not sure if such operations also cause the problem?

Cheers,

Austin

Matija · Sep 12, 2023

YWei

When you are training, do you normalize the images? You would have to find how image is transformed after reading and before being passed to the model during the training. I would assume that image is read as a Tensor with values between 0 and 1, and then normalized using ImageNet mean and scale (mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]). Since images on OAK come with values in range [0-255], you have to multiply this by 255, and then set it like:

--mean_values [123.675,116.28,103.53]
--scale_values [58.395,57.12,57.375]

Also, they are likely read in RGB, while on OAK they are passed in BGR if you haven't specifically specified it to be RGB. This means that you will additionally have to use the --reverse_input_channels flag.

For YoloCLS, just using this flag and then scale and mean as you currently use should suffice.

YWei · Sep 13, 2023

Hi Majita Matija

Thank you, I think I already found the problem root: when using OpenCV function to capture frames from videos, it just changed the colour channel order of frames. So I merely applied a code line like this to solve it:

frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

It's tricky but easy to troubleshoot, so no need for doing more complicated processing like normalization or scaling I bet

Thanks,

Austin