Beginner's Guide to Robotic Vision, Part 2

erik

Now that we’re set with the basic concepts of Spatial AI, let’s move on to talk about Machine Learning (ML). If you missed Part 1 in our Beginner’s Guide you can go back and read it here.

Throughout this first post we mentioned the concept of “training” a few times–like when we talked about a pre-trained object detector, or the training of a system that takes place for purposes of semantic segmentation–but we didn’t define how that training actually happens.

While there are different kinds of training that exist across the robotic vision landscape, what we’re often referring to here is what takes place in ML. Just like a person can train by lifting weights to get stronger, or practice painting to become a better artist, so too can a robot learn to better identify and respond to objects. It just needs our help.

To help ground our discussion, let’s define an example project where ML would be useful in solving a problem. Congratulations, we’ve just founded an automated delivery company. Our vehicles are amazing. They already navigate using existing road infrastructure, weave in and out of traffic with ease, and stay powered all day using only solar energy. We’re going to be a smashing success. The one problem? After only a short time in service we’re noticing an excessive amount of damage to wheels and shock absorbers, and we soon discover why: potholes.

What’s the solution? Well, we could pull our entire fleet out of service and redesign it from the ground up, or we could try to lobby local governments to commit more resources to road maintenance. But neither plan is viable. Thankfully, we have a third option: we can teach our fleet to simply recognize and avoid the potholes.

In order to successfully navigate, let’s assume that our vehicles are already employing semantic depth and segmentation to identify where they’re permitted to drive, and a combination of instance segmentation and object identification to avoid colliding with other vehicles or hazards around them. What we need them to do now is layer in the nuance of potholes–of all shapes and sizes–as another obstacle to avoid, while still maintaining all of this other context. The first step in accomplishing this is developing a pothole image database.

Developing databases for ML is no small task. In a nutshell, in order for our model to learn how to identify a pothole, we need to show it hundreds if not many thousands of examples. It needs to see shallow holes and deep holes, wide holes and narrow holes, dry holes and holes filled with water, holes at night and holes at day. And all of these images need to be labeled, also referred to as image annotation. In many cases this labeling happens manually (more on that in a moment), which is just another way of saying that a person needs to appropriately classify and place a bounding box around the object in question, or draw a semantic segmentation mask classifying each pixel. Eventually, our model will be able to do this automatically, but first we need to teach it what’s significant and what isn’t. With these images we are developing a baseline of what a pothole “is,” and by feeding more and more images into the model we can gradually increase its confidence interval to the point where it can correctly identify novel examples of potholes.

Collecting, labeling, and training a model from the ground up can be both time and labor-intensive, but luckily there are a number of options to help keep things moving forward. Thankfully, as alluded to in Part 1, there are all kinds of pre-trained models out there where all of the heavy lifting has already been done, and that cover a huge range of different purposes. The robotics community is nothing if not collaborative and hard working. Open-source work is hugely beneficial to everyone, and is why it’s such a big point of emphasis for us at Luxonis. Moving on, it’s also possible to supplement image datasets with synthetic components, as discussed here. With synthetic supplementation, it’s less about collecting images and more about creating them. And then finally, there are many services available to help companies with dataset development, including those offered by Luxonis, which you can read more about here.

There’s no other way to say it, Machine Learning is a complicated process. There is a lot of highly specialized programming, math, and training (of humans) required to teach an AI to think how we want it to. Conceptually, what’s happening makes perfect sense, since learning through experience is something people do naturally. But while AIs are powerful tools, what they lack is intuition, meaning a significant effort is needed to differentiate even between things that we as people would never connect together.

Is the Grand Canyon a pothole? What about an errant t-shirt tossed on the ground? What about a shadow? Countless permutations need to be fine tuned to get a model where it needs to be (which is why synthetic image supplementation is so helpful). Luckily, once the hard work of the training is complete and a model is ready to be used in real-world scenarios, the automation it offers is extremely useful.