I think I'm finally with the OAK-D and DepthAI in terms of ability to determine depth. I'd like now to move on to object detection. I've run many of the examples that combine object detection and depth, but what I really want to do is define a custom set of objects in my environment; initially they will be used to assist navigation of that environment.

I don't really understand enough about NNs, AI or ML to even get started, at least not without some extra help. Maybe pointers to some good references would be very helpful, but I've also got some specific questions below.

I've found the "Custom Training" page under Tutorials, and have opened and browsed through the "Easy Object Detector Training" Colab. I have not yet had the guts to run it. That Colab suggests you can train using your own images. But, they have to be annotated. I did a bit of research and it seems that there may be different forms of annotation. And it is quite clear there are many different annotation tools. So, the I have to ask

  • What is the form of annotation required to run the Colab successfully?
  • Are there annotation tools that produce this format? Are there free tools that do so?

I'd also appreciate some guidelines on the training, validation, and final test images. I've read that for any object there should be images from different views, different lighting conditions, probably other things.

  • What factors really matter?
  • Also, how many images of a single object are sufficient. Generally there will be significant differences between most of the the objects I want to detect; I'm not sure if that matters.

Thanks very much for any assistance.

I shall comment on my post, if only to show that I'm not totally lazy. While I think I've answered some of my initial questions, I've only generated more.

I looked a bit deeper into the "Easy Object Detector Training" Colab, especially in the area of the kind of data it expects. I tracked down the "fruit" image and annotation repo on GitHub and found that the annotations are in what at least in one reference is called "Pascal VOC" (an XML form). I assume that means VOC works for the Colab.

After poking around more in the repo, I found a recommendation for an annotation tool (labelImg). So that means I can use that tool (I've found other candidates that are also free, and potentially easier to use).

In the repo, I also found additional helpful instructions on how to train a model with my own set of objects. For example, it discusses image sizes. The image size fed to the DepthAI pipeline is 300x300. It seems, however, that the training images are best at 800x600 or 600x600; this however seems to contradict the instructions in the Colab code comments which says "For faster training time, images should be resized to 300x300 and then annotated". Looking at the "fruit" training images, however they range from 350x350 to at least 1500x1500; the annotation file content corresponds to such sizes.

QUESTION: I wonder about the pixel aspect ratio. Does that matter? Should I capture training/test images with square pixels and crop (and maybe scale) to get square pixels at 800x600 or 600x600? Or is it OK to take a picture at any resolution and then scale to get 800x600 or 600x600? In the DepthAI pipeline, same question. In a DepthAI example I examined, the color camera preview resolution is set to 300x300 and that gets fed to the MobileNetDetectionNetwork, and presumably that means the pixels will not be square.

QUESTION: I am also curious about the following in the Colab code comments: "Images should contain the objects of interest at various scales". Is there a limit? I assume that the object in some training images will almost fill the entire image. Is there a minimum size that is useful? Should there be images where the object fills only 1/10 of the image?

Hey,
I’ll try to answer both you comments 🙂

Regarding some good references, machine learning is a very broad field, and I don’t know what to recommend you. I’ve heard there are good courses on Coursera or a similar website to get you started.
For learning PyTorch, I’d recommend: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html and for TensorFlow this one from Google looks really nice on a quick glance: https://developers.google.com/machine-learning/crash-course. And I think it also covers some necessary topics, like gradient descent, so it might be OK to just start with that. If it’s too complex, I believe there are many introductory tutorials online.

What is the form of annotation required to run the Colab successfully?

We are actually updating this Colab, but as you already figured out, it’s trained on PASCAL VOC dataset and has annotations in XML format. The tool I am linking below also has an export option for this format.

Are there annotation tools that produce this format? Are there free tools that do so?

I haven’t annotated any data sets myself, but this looks like a good annotation tool: https://imglab.in/. And there are also multiple export options, including the one that’s required to successfully run the notebook on custom data. But if you find any other tool that’s easier to use, feel free to use it, just make sure that the XMLs are of the same format.

What factors really matter?

Usually, the goal when training a machine learning model is to create it as general as possible. That’s why it good to have images with different lightning, scales, rotations. So, the more different images of each object you have, the better the model should perform for general images. This also kind of answers your second questions in the second comment. Will your model in production be faced with such images? If yes, then it’s useful to train it on such images.

Also, how many images of a single object are sufficient. Generally there will be significant differences between most of the the objects I want to detect; I'm not sure if that matters?

So, for the same reason as above, the more the better. There’s no “right” number of sufficient images, as this also depends on how the model is going to be used. But the general rule here is the same as above, the more you will have, the bigger the chance that the model will learn the features of the objects. If you have a smaller number of images, usually transforms like flipping, rotations, resizing and photometric variations (hue, saturations, …) are performed during the training process.

I wonder about the pixel aspect ratio.

There is no need to have 300 x 300 image size. You can define the size in the configuration file in config. Once you set it up, I’d say it’s best to train the model on the images of the same size, as well as perform inference. You can also of course try resizing them and train it on such images but I am not sure what the exact effect would be, I’d assume the performance would be slightly impacted.
In the pipeline, by default, I think preview size keeps the default aspect ratio and just crops the image in the center. You could also use ImageManip to resize the window if you’d want preview size to be bigger.

Thanks very much Matija! I think you mentioned the root of the problem; the field is so broad, I'm not even sure of the right search terms to use to find helpful material. But, you've given some useful pointers.

Your response to questions about training image sizes and the pipeline made me modify an example, rgb_preview.py, to display both the isp output and the preview output. You are correct about the behavior. The ColorCamera node, when given a preview size of 300x300 clips the width of the WxH image to produce square pixels and then scales both dimensions of that result by 300/H to produce the preview. This made me curious about detecting object off-center, which I think will be critical for my navigation approach. I then looked at the example rgb_mobilenet_4k.py; it configures non-square pixels, so the the original WxH scene is scaled with different scale factors so that the preview 300x300 image contains the entire original scene. That suggests that object detection is somewhat tolerant of distortion of various sorts. Lovely!

Thanks again.

    Hello gregflurry ,
    on this topic, you should check out How to maximize FOV tutorial. TLDR; you can either thumbnail the image or change aspect ratio (squeeze it), to get full width at 1:1 aspect ratio (as many NN models require).
    Thanks, Erik

      erik Thanks for the reference. I had not thought about letterboxing.

      Given my unfortunate curiosity I have to ask two follow up questions:

      • how does one determine if a NN model requires a 1:1 ratio? is it about how it is trained or something else?
      • I have found several places in the documentation that say the MobileNetDetectionNetwork uses 300x300 pixels. All the examples I've studied set the image going to that node at 300x300. I don't remember seeing anything that says why it uses 300x300.. I assume performance, or maybe some physical limitations. Must the size be 300x300, or is that just something for optimal performance? What is the reason?

      Thanks.

      • erik replied to this.

        Hello gregflurry ,
        the NN size depends on the NN itself. Pretrained mobilenet model has input size of 300x300 (documentation here), pretrained yolo model has input size of 416x416 (docs here). The why is a trade of accuracy vs speed - but usually bigger models will be slower but more accurate.
        Thanks, Erik