How to get label from TwoStagePacket and append to string

vital

I am trying to use a Yolo model for OCR. It has a label for each character. I am passing a detection into the model and would like to have the ocr detect a character and have that label appended to a string and when all characters have been detected to return the string. I can't figure out how to parse the label from the NN once the packet has been passed. How can I do this?

(FYI your in browser editor incorrectly implements the 'code' block function. It gives one backtick and three are required.)

Thanks.

from depthai_sdk.oak_camera import OakCamera
from depthai_sdk.visualize.configs import BboxStyle, TextPosition
from depthai_sdk.visualize import *
import time
def ocr(packet):
    timestamp = int(time.time() * 10000)   

    i = 0
    for detection in packet.detections:
        #store label as string in a string (detection.label[0] == r, decrec == r; detection.label[1] == a, decrec == ra)
        #i +=1
        break
with OakCamera() as oak:
    color = oak.create_camera('color')
    det = oak.create_nn('c:/users/user/code/models/best.blob', color, nn_type='Yolo', tracker=True)
    rec = oak.create_nn('c:/users/user/code/models/ocr.json', input=det, nn_type='Yolo')
    det.config_yolo(num_classes=1, coordinate_size=4, anchors=[4.703125,3.501953125,7.5703125,5.8203125,13.46875,9.1953125,27.671875,16.53125,61.4375,31.984375,
                                                               111.625,62.09375,184.125,83.8125,251.75,129.25,344.5,185.875], 
                                                               masks={"side52": [0,1,2],"side26": [3,4,5],"side13": [6,7,8]}, iou_threshold=0.5, conf_threshold=0.5)

    rec.config_yolo(num_classes=36, coordinate_size=4, anchors=[10.0,13.0,16.0,30.0,33.0,23.0,30.0,61.0,62.0,45.0,59.0,119.0,116.0,90.0,156.0,198.0,373.0,326.0],
                      conf_threshold=0.5, iou_threshold=0.5, masks={"side52": [0,1,2],"side26": [3,4,5],"side13": [6,7,8]})
    rec.config_multistage_nn(debug=True)
    det.config_nn(resize_mode='crop')
    #oak.visualize(det.out.passthrough)
    oak.visualize(rec.out.twostage_crops, scale=2.0)
    visualizer = oak.visualize(rec.out.main, fps=True)
    visualizer.detections(
        color=(0, 255, 0),
        thickness=2,
        bbox_style=BboxStyle.RECTANGLE,
        label_position=TextPosition.MID,
    ).text(
        font_color=(255, 255, 0),
        auto_scale=True
    ).tracking(
        line_thickness=5
    )
    oak.callback(rec, callback=ocr)
    oak.start(blocking=True) `

erik

Hi vital ,
If your ocr.json has label mappings, your code should work.

def cb(packet: DetectionPacket):
    text = ''
    for dets in packet.detections:
        text += dets.label
    print(text)

In my case it printed "laptoppersonperson" (as it found a laptop and 2 people). Just note that using object detection model for OCR isn't really ideal (which you would quickly find out), I would suggest looking at text detection + text recognition models instead, eg. https://github.com/luxonis/depthai-experiments/tree/master/gen2-ocr

Thoughts?
Thanks, Erik

vital

erik

Can you please tell me how I get the 'rec' (ocr) NN to run its own detections on the packet sent to it by 'det'? I get the cropped image but when I send 'rec' to the callback it just gives me the 'det' detections when it should be detecting the letters on the plate. See the screenshot.

This is what I get when I run the 'ocr' NN by itself:

vital

Thanks Erik for your help, but I think my problem is that the second stage is using the detection model 'det' and not the recognition model 'rec'. The first has only one class. My output from your code is '0'.

I am trying to crop that detection and pass it to the second NN to get that to read the characters. I know it isn't ideal for OCR in general, but this is a very specific use case and it works perfectly if manually I crop the detections and feed them in to it as a single NN as images.

erik

Hi vital ,
I'd rather suggest using an actual license plate recognition model. There are a few pretrained LPR models if you check on github/google.
Thanks, Erik

vital

erik

So the answer is... don't bother trying?

I have a specific reason why I am trying to figure this out. If you can't or don't want to answer the question then that's fine but let me know so that I can go somewhere else for help.

erik

vital My suggestion was to use a different approach. If you want a concise answer ask a a concise and thought-through question. Or a minimal repro example where something doesn't work. Above is neither.

vital

'How to get a label from a TwoStagePacket' in a two stage inference using your SDK is something I consider concise and answerable. I appreciate the effort, but the answer I was given was how to get the label of the detection packet, not the second stage. If you don't understand what I am doing you can ask a follow-up. I try not to overburden with information especially when I am suspicious that I am just missing something simple due to a lack of experience.

A problem that many people have when dealing with complex subjects with which they are unfamiliar is that they don't know exactly what to ask. Like, if my bike chain kept falling off when changing gears and I didn't know anything about bikes, I couldn't know to ask about how to adjust a derailleur, I would have to figure out how the mechanism worked and then the terms for the components, or I would just have to ask something like 'how do I keep my chain from falling off when I go down hills'. Hope this helps.

jakaskerl

Hi vital
I'm not sure if you have checked this out, but here. This does a very similar thing to what you are doing.

Hope it helps,
Jaka

vital

jakaskerl I have seen that. Can you please explain something for me?

What does this do and exactly how does it work and how does one know to perform this function:

    for det, rec in zip(packet.detections, packet.nnData):
        emotion_results = np.array(rec.getFirstLayerFp16())
        emotion_name = emotions[np.argmax(emotion_results)]

The issue I have at the moment is that the examples help structure my code but when I try to figure out what specifically happens and why, I get into a roadblock like the function above. 'GetFirstLayer' or even layers or how tensors interact with depthai or how the models are integrated (what is a blob? what is in it? what is the xml file? how does one know what layers are in the model?) -- none of these things are commented in the code or mentioned beyond passing in the docs.

I appreciate your taking the time to help.

vital

vital Well, for anyone interested, I think I figured it out:

        #go through and for each detection find a corresponding packet to run recognition on
        for det, rec in zip(packet.detections, packet.nnData):
            #emotion results are output as so: 
            #prob_emotion, shape: 1, 5, 1, 1
            #Softmax output across five emotions 
            #(0 - ‘neutral’, 1 - ‘happy’, 2 - ‘sad’, 3 - ‘surprise’, 4 - ‘anger’).
            #numpy uses array object to keep the data formated properly because python 
            #is an untyped language which makes it crash when you don't do this because it is stupid
            #getFirstLayerFp16 is the on device nn detection layer which runs in fp16 mode. 
            #I think this gets called whenever the results have to fit in an array?
            emotion_results = np.array(rec.getFirstLayerFp16())
            #pick the emotion which is highest of the indices in emotion_results
            emotion_name = emotions[np.argmax(emotion_results)]

erik

vital So you are first running text detection, then crop the fram on the text and run "character" detection model. So each rec is an dai.ImgDetections class, where you should loop through it and get the char_detection.label:

import depthai as dai
#...
for text_det, char_dets in zip(packet.detections, packet.nnData):
    text = ''
    char_dets: dai.ImgDetections
    for char_det in char_dets.detections:
        text += char_det.label

vital

erik I shall name my first born in honor of you!

It is getting the characters out of order, but a big hurdle has been jumped over thanks to that tip about 'ImgDetections' class and the char_dets loop.

I also have to fix it crashing when there is no valid detection in the frame. Onward and upward.

Question: does it automatically know that if you have a plural item and name something the same without an 's' that it is a part of that?

Question 2: what is: char_dets: dai.ImgDetectionsdoing? Why can't we just write for dai.ImgDetection in dai.ImgDetections.detections?

jakaskerl

Hi vital

Answer 1: I'm not entirely sure what you are asking, but dai.ImgDetections.detections is an iterable of dai.ImgDetection classes, which all have their own confidence, label, etc... A char_det is an ImgDetection instance derived from char_dets, which is a ImgDetections instance.

Answer 2: By declaring char_dets as dai.ImgDetections, you are essentially stating that char_dets will hold an object of the dai.ImgDetections type. This allows the IDE or code editor to provide suggestions and perform type checking for that variable, helping with code correctness and avoiding potential errors.

Hope this helps,
Jaka

erik

Before naming your firstborn, note that "Erik" with a "k" isn't all that common 😅

Jokes aside, great that you got it working. You'd likely want to sort the detections by chat_det.xmin increasing, so the detection (character) on the left will be first.

vital

erik

    def cb(packet: TwoStagePacket):
        text = ''
        for det, char_dets in zip(packet.detections, packet.nnData):
            char_dets: packet.ImgDetections
            detections = char_dets.detections

            # Sort detections by xmin
            sorted_detections = sorted(detections, key=lambda det: det.xmin)

            # Collect label values in a string
            text = ''.join(chars[det.label] for det in sorted_detections)

        print(text)

I feel like I kinda am starting to understand this stuff...

erik

vital That's impressive! I was a bit skeptical of this approach but looks like the second-stage model (for characters) is quite robust 👍️