Wrong depth estimation cased by repetitive patterns

yuanma

We have been using OAK-D Gen 1 cameras in our robots mainly for obstacle avoidance purpose, and in most of the scenarios we can get good depth from cameras, however, there are some hard cases where wrong depth estimation is given due to the repetitive patterns in the working environment, as shown in the images below:

case 1: wrong depth estimation of fridge grill.

case 2: wrong depth estimation of grill of a door.

In these 2 images, the 3 sub images from left to right are: left rectified image, right rectified image, and the colorized depth image based on the depth value. The color from red -> purple means close -> far distances. In these cases, the estimated depth in the grill part were both considered much closer than it actually is, which cause ghost obstacles and can potentially stop robots' movement.

We all know getting good disparity calculation in these cases is a hard problem, especially without knowing the implementation details of the stereo matching algorithm. That's why we would like to get help from Luxonis.

Engineers who design the Intel realsense cameras have written a really nice article (https://dev.intelrealsense.com/docs/mitigate-repetitive-pattern-effect-stereo-depth-cameras) listing the solutions that can mitigate this issue. Within these solutions, we are especially interested in the method "Second Peak Threshold and Census Enable Registers" since it requires no addition hardware and can be done by just changing the stereo matching algorithm alone in the firmware side. I'm wondering if Luxonis has spent some efforts in this direction?

EDIT 1: The rectified image size we feed into the StereoDepth node in 400P and here is the our StereoDepth node config we are using which is found from the printed pipeline schema:

"name": "StereoDepth",
"properties": {
    "alphaScaling": null,
    "baseline": null,
    "depthAlignCamera": -1,
    "depthAlignmentUseSpecTranslation": true,
    "disparityToDepthUseSpecTranslation": true,
    "enableRectification": true,
    "enableRuntimeStereoModeSwitch": false,
    "focalLength": null,
    "focalLengthFromCalibration": true,
    "height": null,
    "initialConfig": {
        "algorithmControl": {
            "centerAlignmentShiftFactor": null,
            "customDepthUnitMultiplier": 1000.0,
            "depthAlign": 0,
            "depthUnit": 2,
            "disparityShift": 0,
            "enableExtended": true,
            "enableLeftRightCheck": true,
            "enableSubpixel": true,
            "leftRightCheckThreshold": 5,
            "numInvalidateEdgePixels": 0,
            "subpixelFractionalBits": 3
        },
        "censusTransform": {
            "enableMeanMode": true,
            "kernelMask": 0,
            "kernelSize": 0,
            "threshold": 0
        },
        "costAggregation": {
            "divisionFactor": 1,
            "horizontalPenaltyCostP1": 250,
            "horizontalPenaltyCostP2": 500,
            "verticalPenaltyCostP1": 250,
            "verticalPenaltyCostP2": 500
        },
        "costMatching": {
            "confidenceThreshold": 255,
            "disparityWidth": 1,
            "enableCompanding": false,
            "invalidDisparityValue": 0,
            "linearEquationParameters": {
                "alpha": 0,
                "beta": 1,
                "threshold": 127
            }
        },
        "postProcessing": {
            "bilateralSigmaValue": 0,
            "brightnessFilter": {
                "maxBrightness": 256,
                "minBrightness": 0
            },
            "decimationFilter": {
                "decimationFactor": 1,
                "decimationMode": 0
            },
            "median": 0,
            "spatialFilter": {
                "alpha": 0.5,
                "delta": 0,
                "enable": false,
                "holeFillingRadius": 2,
                "numIterations": 1
            },
            "speckleFilter": {
                "enable": true,
                "speckleRange": 50
            },
            "temporalFilter": {
                "alpha": 0.4000000059604645,
                "delta": 0,
                "enable": true,
                "persistencyMode": 2
            },
            "thresholdFilter": {
                "maxRange": 65535,
                "minRange": 0
            }
        }
    },
    "mesh": {
        "meshLeftUri": "",
        "meshRightUri": "",
        "meshSize": null,
        "stepHeight": 16,
        "stepWidth": 16
    },
    "numFramesPool": 3,
    "numPostProcessingMemorySlices": 4,
    "numPostProcessingShaves": 8,
    "outHeight": null,
    "outKeepAspectRatio": true,
    "outWidth": null,
    "rectificationUseSpecTranslation": false,
    "rectifyEdgeFillColor": 0,
    "useHomographyRectification": null,
    "width": null
}

And here is a sample pair of rectified images which Luxonis can be used as the test input:

left rect:

right rect:

These images are rectified already, so you should be able to reproduce this issue without the calibration data. We follow the implementation of https://github.com/luxonis/depthai-python/blob/main/examples/StereoDepth/stereo_depth_from_host.py to generate depth from the host by feeding the test image pair to the StereoDepth node directly.

RickArmstrong

Hi, were you ever able to resolve this? We have the same problem, but with a different scene:

Seems like a fundamental problem with stereo vision, but wondered if you've found a technique or OAK-D-specific settings for mitigating the issue.

Rick

erik

Hi Rick , yuanma,
The "Second Peak Threshold and Census Enable Registers" is basically just stereo confidence threshold, which you can set with the costMatching.confidenceThreshold. You have it set to 255 (max value, so least confident). I'd suggest trying values 100/150/200 - fillrate will be reduced, but confidence will increase.
Thanks, Erik

RickArmstrong

erik We tried this yesterday, and it helped immensely. We already had the threshold set down to 100. We now have it at 10, and it appears to work well-enough for our purposes. Further testing will show. We're using the OAK-D as a low-budget obstacle sensor, and we cook the 3D data down to a 2D scan. We don't need a super-great depth map.
Thanks for the advice,

Rick

erik

Another alternative approach would be to have tripple stereo - horizontal + diagonal, that we also implemented in our Percept devices (which hasn't been released). IIRC there are 2 papers on this topic (Triple-SGM and some other one), and it should reduce issues with repetitive textures. Thoughts?

erik

cc @RickArmstrong @yuanma (forgot to tag previously)