hdnzngr

Xxperroni · Apr 8, 2024

I'm working on a mobile robotics project that's inching towards production. Some of our partners have expressed concerns about the long-term viability of the OAK platform: we know the RVC2 cameras will be available until 2030, but already there are some recent AI model we cannot use due to OpenVINO dropping Myriad X support and the lack of an alternative architecture. This is only to get worse as the AI ecosystem moves on.

I subscribed to the RVC4 newsletter last year, but have yet to receive any issues — if this was simply a delivery problem, it would be nice to have a page where we could read past issues. We would also like to know whether we'll still be using OpenVINO for the AI models (and if not, what will take its place), and whether / when the new RVC4 cameras will be available for pre-order.

YaoXue · Feb 23, 2023

erik

Hi Erik, thank you for your reply. Is it true that the only connection i need is from STROBE(or FSYNC, please correct me) channel of the master camera to the FSYNC channel of the slave cameras? I'm asking so I may not need to do a full M8 connection box but just some simple jump wires on a bread board for early prototype.

And also, could you please share with me the conventions of the signal that triggers the FSYNC? How much volt, what wave form, duration and etc. Thanks!

Brandon · May 27, 2022

What's going to win long-term in autonomous driving? Tesla? Waymo?

Largely my view on this subject is not supported by the industry. The industry tends to fall into 2 camps:

LiDAR. "No one ever got fired by buying LiDAR." Waymo camp
Monocular depth. "LiDAR is a crutch - solving vision is what really matters." Tesla camp.

And I actually subscribe to secret option 3, which says that 2 is mostly right, but just forgetting that more information is so valuable and that 3 is a crutch that is insufficient once we have matured AI/CV.

Why 2 is mostly right:

Solving vision is where all value lies. That's where the real context is. That's where the supplementarily monetizable data lies (e.g. it's hard to pick up child-traffickers from LiDAR data).

With monocular, information is fundamentally missing. And so to make up for the missing information, the time-domain is used instead. Which makes monocular slower, worse performing, higher-latency and have awful corner-cases where it doesn't work at all. The idea is that for lack of the alternate views of an object (which provide the neural network with the information necessary to know depth), the apriori knowledge of similar scenes and/or the time-domain is used to give the alternate views... but in some cases the time domain will not have the requisite information for that. Or it will result in just worse depth or bad latency.

Why 1 is a distraction when playing the long-game:

Shortcuts are great for winning the short game. LiDAR is a shortcut. In college back in ₂₀₀₆ there was a robotics competition where you had to navigate a maze, avoid obstacles while doing so, pick up an object, and then repeat all of that to return it back to where the robot started. I competed with people way smarter than me, and way more experience.

I knew I couldn't compete with those guys. I knew the proper navigation planning required was way more than I could accomplish. So I looked for a shortcut that prevented me from using it at all. So I experimented with how accurate the motor encoders could be. And since it was a controlled/indoor environment, they turned out to be SUPER accurate. And after tweaking and some trial/error to get an idea of when they had issues, and by how much, my team and I were able to literally solve the whole problem hard-coded. We literally hard-coded all the steps required, and the motor-controllers/encoders/wheels/arm/etc. were good enough to do it.

So our robot looks FREAKING AWESOME and did the whole thing first try, completely every challenge (you got points along the way for each thing you passed so as to do a tie-break if no one completed the whole thing) compared to the best competitor that got at most 50% of the solution.

Now this was cool, and we won and had prizes and stuff - but it was freaking useless. It was a shortcut to make something impressive, fast. It was a crutch. And if you wanted to build off of this, you couldn't. You'd have to just start over.

I view LiDAR the same way. Since you get accurate sparse-point measurements short-range and long, you can make something that drives well in many conditions, easily/quickly. It's like the hard-coding. The trouble is LiDAR is sparse in comparison to CV. It gives enough information to "demo". But when you go from winning this short-term competition of who can look like they're further, faster (just like I did with hard-coding) to actually trying to make a full-production solution that matters to the world, and is scalable - LiDAR doesn't have the requisite information. Vision does.

And don't get me wrong, CV + LiDAR is great for super safety critical stuff. But CV is where the real value is. LiDAR is then an idiot-check hard-stop backup system. Just like most lift-critical systems have those.

But that LiDAR backup system is still missing a lot of information. So ultimately I think a redundant CV system will win. As then you have 2x systems with sufficient information to "really understand".

And then this brings to another point: Any LiDAR-based solution that wants to "get serious" also will need CV, as LiDAR doesn't have enough information. So eventually, LiDAR-heavy teams end up having to solve CV to win/scale.

And then long-run, ignoring investor optics, the demands to have progress, etc. - tech-stack-wise, LiDAR is actually a distraction - as LiDAR solution's can't truly robustly operation without CV. And so the more time put into it, the less time into solving CV.

That said, startup in autonomous driving trying to wow investors, LiDAR is absolutely the right choice. As just like that robotics competition, using the shortcut produced a huge WOW effect. And that's super useful for closing funding rounds etc. It's just a distraction to the tech stack development. But if you close a $1 billion funding round because of it - it's what enabled building the right tech stack.

And this is why I actually think right now, for any autonomous driving company in startup war-mode, LiDAR is the right choice. But they need to have their eye on the longterm by using LiDAR to catapult their finances to pivot to CV.

Note that the above is purely an analysis WRT 75 mph+ autonomous driving for moving people (e.g. Tesla, Waymo, etc.). For autonomous mobile robots (AMRs; forklifts, food delivery, etc.) there are similar trades, but vision becomes even more of a "no brainer", as in people-moving the speeds are 75 mph+, which require depth-vision to 350 meters and beyond, which prior to DepthAI and OAK was "hard". Whereas for autonomous mobile robots (typically <<75mph), the depth-sening needs are "not hard". So LiDAR is an even-worse choice for such AMRs. As the transition to vision will happen sooner there, and so the risk/failure-probability of investing in LiDAR is significantly higher - and the "WOW factor" is largely non-existent - and conversely there are insane "WOW factor" capabilities from DepthAI/OAK-based vision on such platforms that are nearly-impossible to pull off with LiDAR. And particularly impossible when factoring in that on AMR, cost is a lot more sensitive, so the LiDARs used then have to be sparser, and have even-worse performance than vision.

And likely if this is read, the point will be made that "LiDAR isn't sparse", followed by a response (by me) showing that you can build a 360° stereo-depth CV solution with 36.8 million depth points and 300+ meter range for <$900. And you actually just can't build that in LiDAR. No company can do so. And anything coming close is _$100,000.

So not only does vision provide the long-term value, it's also orders of magnitude less expensive.

Brandon · Mar 1, 2022

Object Avoidance

This problem involves avoiding objects both those seen before and those never-before-seen. The approach that Luxonis likes to take for such tasks is to use at least semantic depth, usually in addition to known-object detection, depending on the needs of a given application.

Semantic Depth for Unknown Unknown Object Detection and Avoidance

One of the classic problems in autonomous robotic navigation or actuation is to not impact both known and unknown objects. Known objects are things that are known a-priori to the installation to be encountered - such as tools, other machines, workers, equipment, and facilities. Unknown objects are things that may not be anticipated - or even things that are completely unknowable or never-before-seen.

For known objects, training an object detector is sufficient as this is a “positive” form of object detection: “Cat in the path, stop.” “Soccer ball in the path, stop.” etc.

But the most important thing in object avoidance is actually unknown unknown items.

To make up an example, imagine a person in some unknown form of occlusion where only part of a limb is visible while they are wearing clothing with a “flying taco squirrel” as the only visible portion to the perception system. Given that a “flying taco squirrel” is both unknown (as of this writing no such thing exists - but it could in the future) and the only visible portion of a human is this “flying taco squirrel” - there is no possible way that a “positive” form of object detection will be able to detect such an object. As a “positive” system requires being trained on the class of object - or at least a set of things that are similar-enough that a class-agnostic object detector can be used - neither of which are possible in this case. (And since we have no idea in the slightest what a "flying taco squirrel" would look like, we cannot guarantee any semblance of similarity. And worse, this is a "known unknown". The problem we want to be able to solve is the "unknown unknown".)

And this is where a “negative” object detection system is required in such generic obstacle avoidance scenarios. And a very effective technique is to use semantic segmentation of RGB, Depth, or RGB+Depth.
And in such a “negative” system, the semantic segmentation system is trained on all the surfaces that are not objects. So anything that is not that surface is considered an object - allowing the navigation to know it’s location and to take commensurate action (stop, go around, turn around, etc.).
Luxonis will use simulation here as well to train this semantic-depth-based “negative” object detection system. Luxonis has used this technique with success in many object avoidance applications including in significantly non-structured environments including public parks in the presence of the public.

Some public portions of that work are shared here: and examples of the simulation environment, and an example from that public talk is reproduced below:

It is worth nothing that this is real-world testing of a semantic depth system which was:

Trained only in simulation and tested on a real-world autonomous vehicle using OAK-D.
Trained only on 80 images (intentionally, to see how quickly the network converged)
Based on an internal semantic architecture which we developed for this purpose

As one can see, several objects that are VERY hard for traditional depth systems to pick up properly are picked up here, and properly labeled at 10+ FPS, including (red = object, green = traversable, blue = sky):

The chainlink fence.

a. The entire fence is properly segmented as an object that is not traversable. Chainlink fences are a canonical problem for every mechanism of depth sensing (stereo, ToF, LiDAR, structured light) etc. but are easily perceived by this semantic depth system.
The repeating pattern of the warehouse.
a. This is a canonical problem for stereo systems.

b. And much work has gone into trying to solve it (e.g. here).
c. Despite this, with only 80 synthetic images, this semantic depth is already identifying a large portion of the warehouse correctly.
The root beds around the trees.

a. Running over roots is one of the pernicious problems in this industry
b. And semantic depth quickly converged to properly labeling them as objects, despite only 80 training images from simulation

So for unknown-unknown, this sort of "negative" object detection is extremely valuable. As you don't need to have ever seen it before, you can just know it's not one of the safe things to drive over (or fly through, or swim through, etc.) and thereby avoid it or stop.

[Known] Object Detection

And best, the Semantic-Depth for unknown-unknown object detection can be combined with standard object detection of known objects, so both known objects can have pre-programmed behavior. E.g. like below for detecting a person and then following commands from that person:

Person detection
Source: https://github.com/geaxgx/depthai_hand_tracker

And in parallel, the robotic system can then not run into things that it doesn't understand or has never seen before.

Summary

Together, semantic depth + object detection, when run on with DepthAI can give unknown-unknown object detection/avoidance and known-object detection (and control) - with both given 3D results - so that the unknown-unknown object and known-objects have locations in physical space, which is incredibly important/necessary for safe robotic operation.

3D hand perception shown below as another known object detection in 3D space:
3D Hand detection
Source: https://github.com/geaxgx/depthai_hand_tracker/tree/main/examples/3d_visualization#3d-visualization-and-smoothing-filter