What's going to win long-term in autonomous driving? Tesla? Waymo?
Largely my view on this subject is not supported by the industry. The industry tends to fall into 2 camps:
- LiDAR. "No one ever got fired by buying LiDAR." Waymo camp
- Monocular depth. "LiDAR is a crutch - solving vision is what really matters." Tesla camp.
And I actually subscribe to secret option 3, which says that 2 is mostly right, but just forgetting that more information is so valuable and that 3 is a crutch that is insufficient once we have matured AI/CV.
Why 2 is mostly right:
Solving vision is where all value lies. That's where the real context is. That's where the supplementarily monetizable data lies (e.g. it's hard to pick up child-traffickers from LiDAR data).
With monocular, information is fundamentally missing. And so to make up for the missing information, the time-domain is used instead. Which makes monocular slower, worse performing, higher-latency and have awful corner-cases where it doesn't work at all. The idea is that for lack of the alternate views of an object (which provide the neural network with the information necessary to know depth), the apriori knowledge of similar scenes and/or the time-domain is used to give the alternate views... but in some cases the time domain will not have the requisite information for that. Or it will result in just worse depth or bad latency.
Why 1 is a distraction when playing the long-game:
Shortcuts are great for winning the short game. LiDAR is a shortcut. In college back in 2006 there was a robotics competition where you had to navigate a maze, avoid obstacles while doing so, pick up an object, and then repeat all of that to return it back to where the robot started. I competed with people way smarter than me, and way more experience.
I knew I couldn't compete with those guys. I knew the proper navigation planning required was way more than I could accomplish. So I looked for a shortcut that prevented me from using it at all. So I experimented with how accurate the motor encoders could be. And since it was a controlled/indoor environment, they turned out to be SUPER accurate. And after tweaking and some trial/error to get an idea of when they had issues, and by how much, my team and I were able to literally solve the whole problem hard-coded. We literally hard-coded all the steps required, and the motor-controllers/encoders/wheels/arm/etc. were good enough to do it.
So our robot looks FREAKING AWESOME and did the whole thing first try, completely every challenge (you got points along the way for each thing you passed so as to do a tie-break if no one completed the whole thing) compared to the best competitor that got at most 50% of the solution.
Now this was cool, and we won and had prizes and stuff - but it was freaking useless. It was a shortcut to make something impressive, fast. It was a crutch. And if you wanted to build off of this, you couldn't. You'd have to just start over.
I view LiDAR the same way. Since you get accurate sparse-point measurements short-range and long, you can make something that drives well in many conditions, easily/quickly. It's like the hard-coding. The trouble is LiDAR is sparse in comparison to CV. It gives enough information to "demo". But when you go from winning this short-term competition of who can look like they're further, faster (just like I did with hard-coding) to actually trying to make a full-production solution that matters to the world, and is scalable - LiDAR doesn't have the requisite information. Vision does.
And don't get me wrong, CV + LiDAR is great for super safety critical stuff. But CV is where the real value is. LiDAR is then an idiot-check hard-stop backup system. Just like most lift-critical systems have those.
But that LiDAR backup system is still missing a lot of information. So ultimately I think a redundant CV system will win. As then you have 2x systems with sufficient information to "really understand".
And then this brings to another point: Any LiDAR-based solution that wants to "get serious" also will need CV, as LiDAR doesn't have enough information. So eventually, LiDAR-heavy teams end up having to solve CV to win/scale.
And then long-run, ignoring investor optics, the demands to have progress, etc. - tech-stack-wise, LiDAR is actually a distraction - as LiDAR solution's can't truly robustly operation without CV. And so the more time put into it, the less time into solving CV.
That said, startup in autonomous driving trying to wow investors, LiDAR is absolutely the right choice. As just like that robotics competition, using the shortcut produced a huge WOW effect. And that's super useful for closing funding rounds etc. It's just a distraction to the tech stack development. But if you close a $1 billion funding round because of it - it's what enabled building the right tech stack.
And this is why I actually think right now, for any autonomous driving company in startup war-mode, LiDAR is the right choice. But they need to have their eye on the longterm by using LiDAR to catapult their finances to pivot to CV.
Note that the above is purely an analysis WRT 75 mph+ autonomous driving for moving people (e.g. Tesla, Waymo, etc.). For autonomous mobile robots (AMRs; forklifts, food delivery, etc.) there are similar trades, but vision becomes even more of a "no brainer", as in people-moving the speeds are 75 mph+, which require depth-vision to 350 meters and beyond, which prior to DepthAI and OAK was "hard". Whereas for autonomous mobile robots (typically <<75mph), the depth-sening needs are "not hard". So LiDAR is an even-worse choice for such AMRs. As the transition to vision will happen sooner there, and so the risk/failure-probability of investing in LiDAR is significantly higher - and the "WOW factor" is largely non-existent - and conversely there are insane "WOW factor" capabilities from DepthAI/OAK-based vision on such platforms that are nearly-impossible to pull off with LiDAR. And particularly impossible when factoring in that on AMR, cost is a lot more sensitive, so the LiDARs used then have to be sparser, and have even-worse performance than vision.
And likely if this is read, the point will be made that "LiDAR isn't sparse", followed by a response (by me) showing that you can build a 360° stereo-depth CV solution with 36.8 million depth points and 300+ meter range for <$900. And you actually just can't build that in LiDAR. No company can do so. And anything coming close is $100,000.
So not only does vision provide the long-term value, it's also orders of magnitude less expensive.