Naive question regarding StereoDepth disparity and depth outputs

Ggregflurry · Sep 25, 2021

I am sure this question is addressed somewhere in the documentation or a discussion post, but I could not find it. I apologize up front.

Based on the discussion of disparity (https://docs.luxonis.com/projects/api/en/latest/components/nodes/stereo_depth/?highlight=disparity#disparity), it seems that one can calculate a disparity value only for points/features that appear in both the left and right images. Said differently, one cannot cannot calculate a disparity value for points/features where the left and right images don't overlap. If one cannot calculate disparity, one cannot calculate depth.

Let's say the left and right images are W pixels wide. It seems there would be vertical 'band' B pixels wide on the left side of the left image where it does not overlap the right image, and a similar band B pixels wide on the right side of the right image where it does not overlap the left image. Thus, it seems that the disparity (and depth) calculations could only be valid where the left and right images overlap in a region that is W - (2 * B) wide; with a band of perhaps estimations B pixels wide on each side. Sort of like this |<-B->|<-- W - (2 * B) -->|<-B->|.

I must be missing something. Any help appreciated.

erik · Sep 25, 2021

Correct on the first assumption gregflurry (can't get disparity/depth where images don't overlap). And correct on the second one as well. That's why you can see that values aren't perfect/accurate at the corners of the depth/disparity map. But for scenes far away this is almost unnoticeable - just a small few-pixel-wide band on the left&right corner of the disparity/depth map.

Ggregflurry · Sep 25, 2021

erik Thanks! I'd begun to doubt my ability to interpret documentation, since I could find no mention of the "restriction".

But your answer leads me to ask a followup question. I used a perhaps inappropriate method to make a very likely inaccurate measurement of the width B by simply showing both the left and right mono camera frames and then estimating the width of the left portion of the left frame not visible in the right frame. For a 400_P resolution (640 wide) frame, I got a B=₃₀ pixels. That seems a bit more than "a small few-pixel-wide" band. How should one really measure B?

erik · Sep 25, 2021

Hello gregflurry . I haven't done calculations yet, but could you share your formula/calculations? For very short distances (20cm) this number can be quite high.

Ggregflurry · Sep 26, 2021

Thanks erik . I don't have equations, what I do have is heuristics that apparently validate your your statements. I will explain what I did and offer the results. Prior to reading, you have to promise not to laugh at my radically unscientific approach.

As I said in an earlier post, I'm running the mono cameras at 400_P (640x400). I get the frames from the OAK-D and simply do a cv2.imshow() of both frames. I'm running on a MacBook and for reasons I don't really understand, the images that get shown actually are 1280x400 on my screen.

I pointed my OAK-D at something that had a somewhat distinct edge. I rotated the device so that the left edge of the image from the right camera aligned with the edge. I then superimposed (manually moving the window) the right image over the left image and used Screenshot to capture the area around the two views of the edge. I then loaded the captured screenshot into Preview. I then used the crop tool to create a box to estimate the width, B, of the potentially inaccurate range calculations. I did a screen capture of that to show you the results. . You can see that the edge in this case is a couch leg that was approximately 2.1 meters from the device. The crop tool indicated roughly B=32 pixel. That is similar to what I reported in my earlier post. I used the exact technique earlier, but I am certain the device was a bit farther away at the time.

Based on your second response, I thought it prudent to measure B for something closer. So I used the same technique on a box that I placed about 0.9 meters from the device. The result is . In this case, the width of the crop box is roughly 80 pixels, i.e., B=80.

Conceptually these findings make sense, in the same way that an object closer to a camera fills more horizontal pixels than the same object fills farther away from a camera. That suggests there is a formula for B that is related to the baseline between the stereo cameras and the distance of the object from the cameras, and probably other things like HFOV. I don't think I have enough knowledge to derive it, however.

In any case, this was an interesting experience and again, I appreciate your participation.

erik · Sep 27, 2021

gregflurry interesting approach that validates your thinking. And as you have already found out, when the scene is further away (eg. your couch), this band gets narrower.
You can also see this band if you just display any depth map; on the right side of the frame. Why the right? Because depth is calculated from the right mono camera's perspective. So the left camera can see the left band (and some additional band), but can't see the right band, that's why there is depth info missing only on the right side of the depth map.

Ggregflurry · Sep 27, 2021

Now that is interesting! I am going to have to ponder that for a while. I've assumed that there would be a band on the depth map on both sides.

However, I've been pondering about the question you asked about "an equation". I am usually happy with empirical results, but figured that since all of this stuff is basically geometry, there should be an equation. I think I derived it, but it is so ridiculously simple, I'll show you the work, to extent the somewhat limited forum tool allows.

Consider the figure below, which shows the HFOV of the stereo cameras. The yellow cone represents the left camera view and the blue cone represents the right camera view. The greenish cone represents where the views overlap and thus accurate disparity/depth calculation is possible.

The cameras are separated by a distance (in cm) ‘BL’; for the OAK-D, BL=7.5 cm. ‘B’ is the width of the area where the camera views do not overlap and thus accurate disparity/depth calculation is not possible. B units can be in cm or pixels. It seems intuitive from the picture that in cm, B=BL, always.

‘W’ is the width of an image in pixels, which is constant for a given resolution. ‘F’ is the width of an image in cm at a distance ‘D’ cm from the cameras; F varies with D.

‘D_V’ represents the minimum distance in cm at which both cameras can view an object and thus accurate disparity/range calculation is possible. D_V depends on BL and HFOV. Further, one can intuit that at D_V, B in pixels equals W. That means, in effect, no depth is accurate!

The next figure aids in deriving the needed equations.

Simple trigonometry shows that

F = 2 * D * tan(HFOV/2) [cm]

The figure also show that one can calculate D_V as using

D_V = (BL/2) * tan((90 - HFOV/2)

HFOV for the OAK-D mono cameras is roughly 72°, so, with BL=7.5 cm,

Dv=5.16 cm

As a check, can calculate B based on D_V.

B = 2 * D_V * tan(HFOV/2) = 7.44 [cm]

Not a bad agreement.

W/F describes the pixels/cm at any given F. So the width of B in pixels is
B = 2 * D_V * tan(HFOV/2) * (W/F) [pixel]
B = 2 * D_V * tan(HFOV/2) * (W/(2 * D * tan(HFOV/2))) [pixel]
B = W * (D_V / D) [pixel]

That is the "magic" equation! Consider my experimental situation, where W=640. I measured B for an object about 210 cm from the cameras and one about 90 cm from the cameras. I measured B=16 at 210 and 40 at 90. Using the equation, at 210, B=16; at 90, B=37. Not too bad!

This gives me some confidence that the equation is correct.

I also set up a bit more controlled experiment with an object at 35 cm from the OAK-D, and a tape measure across the entire area. B measured 7.5 cm as expected. Using the equation for pixels, B=94. I measured B=93. Satisfying!

Hope you enjoyed the diatribe.

erik · Sep 27, 2021

gregflurry awesome, love it!! Great problem solving on your part, and images help a lot with understanding the concept and how you got to your end formula From my quick glance over the equation, it checks out, and awesome that you measured it as well! I will definitely go through this post again when I will be less tired
Thanks again, Erik

Ggregflurry · Sep 28, 2021

erik It took a while, but I now understand your statement "depth is calculated from the right mono camera's perspective". And I think that means the depth values (x,y,z) are from the perspective of the right camera. Assuming the OAK-D is mounted horizontally, I believe that y and z would be the same value whether measured from the perspective of the right mono, the left mono, or the color camera. But x would be different!

And that probably accounts for the StereoDepth.setDepthAlign() method/function. It is overloaded and one can provide either a camera (RGB, mono right, mono left), or a property (center, right, left). I want the depths aligned with the RGB camera, but in my current investigation, I'm not using the RGB camera, so I will have to use "center".

A couple of questions:

does that sound right to you?
is there any real difference between using the camera or the property?

Thanks.

erik · Sep 28, 2021

gregflurry , all above seems correct to me. And I don't think there's any difference between them, they should both achieve the same thing.
Thanks, Erik

Ggregflurry · Oct 2, 2021

erik Sorry for dropping the conversation. I had to tackle another project for a bit and then do some more work to verify a suspicion.

You said in a post a few days ago "there is depth info missing only on the right side of the depth map". I think that needs some qualification. It is certainly true for the "raw" depth information, since there is a band on the right of the right mono image where there is no overlap with the left mono image, so disparity/depth cannot be calculated, and since there is overlap with the left image in the rest of the left image.

On the other hand, if one chooses to align the depth map with the RGB camera, The depth map must be shifted right. I've done a bit of work and convinced myself that the amount of the shift always equals B/2, i.e., the baseline/2 in cm. Further, since the depth image shifts right, there will be an invalid band on the left side of the shifted image.

An earlier discussion showed that the size of the initial invalid band on the right of the right image = BL cm. Since the shift (if I've calculated correctly) is BL/2, the size of the left band is BL/2, and the size of the right band is reduced to BL/2. So an aligned depth map has an invalid band on both sides of the map, and the sizes of the two bands are equal. And, the size of the band in pixels varies with distance.