Hey,
I’ll try to answer both you comments 🙂
Regarding some good references, machine learning is a very broad field, and I don’t know what to recommend you. I’ve heard there are good courses on Coursera or a similar website to get you started.
For learning PyTorch, I’d recommend: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html and for TensorFlow this one from Google looks really nice on a quick glance: https://developers.google.com/machine-learning/crash-course. And I think it also covers some necessary topics, like gradient descent, so it might be OK to just start with that. If it’s too complex, I believe there are many introductory tutorials online.
What is the form of annotation required to run the Colab successfully?
We are actually updating this Colab, but as you already figured out, it’s trained on PASCAL VOC dataset and has annotations in XML format. The tool I am linking below also has an export option for this format.
Are there annotation tools that produce this format? Are there free tools that do so?
I haven’t annotated any data sets myself, but this looks like a good annotation tool: https://imglab.in/. And there are also multiple export options, including the one that’s required to successfully run the notebook on custom data. But if you find any other tool that’s easier to use, feel free to use it, just make sure that the XMLs are of the same format.
What factors really matter?
Usually, the goal when training a machine learning model is to create it as general as possible. That’s why it good to have images with different lightning, scales, rotations. So, the more different images of each object you have, the better the model should perform for general images. This also kind of answers your second questions in the second comment. Will your model in production be faced with such images? If yes, then it’s useful to train it on such images.
Also, how many images of a single object are sufficient. Generally there will be significant differences between most of the the objects I want to detect; I'm not sure if that matters?
So, for the same reason as above, the more the better. There’s no “right” number of sufficient images, as this also depends on how the model is going to be used. But the general rule here is the same as above, the more you will have, the bigger the chance that the model will learn the features of the objects. If you have a smaller number of images, usually transforms like flipping, rotations, resizing and photometric variations (hue, saturations, …) are performed during the training process.
I wonder about the pixel aspect ratio.
There is no need to have 300 x 300 image size. You can define the size in the configuration file in config. Once you set it up, I’d say it’s best to train the model on the images of the same size, as well as perform inference. You can also of course try resizing them and train it on such images but I am not sure what the exact effect would be, I’d assume the performance would be slightly impacted.
In the pipeline, by default, I think preview size keeps the default aspect ratio and just crops the image in the center. You could also use ImageManip to resize the window if you’d want preview size to be bigger.