Transforming a Point from one camera frame to another

chintal

Hello,
I've recently obtained a OAK-D Lite camera for an object size measurement application.

This is the flow I want some help with :

I use cv2.aruco to detect Aruco markers in the RGB frame on the host. I want to use the detected Aruco markers to identify a plane. I know apriori they will all be coplanar.
I understand (please correct me if I'm wrong) that the depth and disparity frames are aligned to the Left Rectified frame.
I want to take these detections from the RGB frame into the Rectified Left frame, and use that from the depth frame to determine the spatial locations of the Aruco markers to get a reliable fix on the plane.
To start, I'm simply trying to draw the detections on the Rectified left frame to ensure the transformations are correct.

This is what I have so far, with a lot of help from ChatGPT (I'm sure all of these must be documented somewhere, but wasn't able to find them. I looked mostly on the Luxonis website and the depthAI docs) :

Loading calibration data from the camera

"""

 def loadCalibrationData(self):
    calib_data = self.device.readCalibration()
    # Intrinsics
    self.intrinsics['rgb'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.RGB))
    self.intrinsics['left'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.LEFT))
    self.intrinsics['right'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.RIGHT))

    # Distortion coefficients     
    self.dist_coeffs['rgb'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.RGB))      
    self.dist_coeffs['left'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.LEFT))     
    self.dist_coeffs['right'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.RIGHT))      
    
    # Assume rectified frames have the same intrinsics as the original left/right frames (?)    
    self.intrinsics['left_rectified'] = self.intrinsics['left']     
    self.intrinsics['right_rectified'] = self.intrinsics['right']     
 
    # No distortion in rectified images     
    self.dist_coeffs['left_rectified'] = np.zeros_like(self.dist_coeffs['left'])     
    self.dist_coeffs['right_rectified'] = np.zeros_like(self.dist_coeffs['right'])      
    
    identity_extrinsic = [ [1, 0, 0, 0],  [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]     
    identity_extrinsic_matrix = np.matrix(identity_extrinsic)      
    # Add identity extrinsics for rectified transformations      
    self.extrinsics['left_to_left_rectified'] = identity_extrinsic_matrix      
    self.extrinsics['left_rectified_to_left'] = identity_extrinsic_matrix         
    self.extrinsics['right_to_right_rectified'] = identity_extrinsic_matrix      
    self.extrinsics['right_rectified_to_right'] = identity_extrinsic_matrix      

    # Extrinsics      
    self.extrinsics['rgb_to_left'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.RGB, depthai.CameraBoardSocket.LEFT))      
    self.extrinsics['left_to_rgb'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.LEFT, depthai.CameraBoardSocket.RGB))      
    self.extrinsics['left_to_right'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.LEFT, depthai.CameraBoardSocket.RIGHT))      
    self.extrinsics['right_to_left'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.RIGHT, depthai.CameraBoardSocket.LEFT))      
    self.extrinsics['right_rectified_to_left_rectified'] = self.extrinsics['left_to_right'] # Same as left to right

"""

This results in this data being loaded, with the frame sizes manually set (Code sets them to THE_400_P and THE_1080_P in the pipeline) :

"""

#####   Frame Sizes  #####
{'left': (640, 400), 'right': (640, 400), 'depth': (640, 400), 'disparity': (640, 400), 'rgb': (1920, 1080)}

#####   Intrinsics  #####

{'rgb': array([[2.97167920e+03, 0.00000000e+00, 2.00892078e+03],
   [0.00000000e+00, 2.97050610e+03, 1.00848987e+03],
   [0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]), 
 'left': array([[451.64602661,   0.        , 336.55810547],
   [  0.        , 451.57305908, 241.29995728],
   [  0.        ,   0.        ,   1.        ]]), 
'right': array([[449.44226074,   0.        , 312.17462158],
   [  0.        , 449.58377075, 239.12860107],
   [  0.        ,   0.        ,   1.        ]]), 
'left_rectified': array([[451.64602661,   0.        , 336.55810547],
   [  0.        , 451.57305908, 241.29995728],
   [  0.        ,   0.        ,   1.        ]]), 
'right_rectified': array([[449.44226074,   0.        , 312.17462158],
   [  0.        , 449.58377075, 239.12860107],
   [  0.        ,   0.        ,   1.        ]])}

#####   Dist Coeffs  #####
{'rgb': array([-1.98579645e+00, -5.05813503e+00, -1.44903013e-03,  3.96117466e-05,
        3\.41700439e+01, -2.07788634e+00, -4.72139835e+00,  3.36404686e+01,
        0\.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0\.00000000e+00,  0.00000000e+00]), 
 'left': array([-8.79453564e+00,  2.61839314e+01, -2.68959237e-04,  1.66668301e-03,
   -1.02677860e+01, -8.80316162e+00,  2.62347431e+01, -1.04048386e+01,
    0\.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
    0\.00000000e+00,  0.00000000e+00]), 
'right': array([-2.41038156e+00, -7.64782906e+00,  2.54874420e-03,  7.66101934e-04,
    1\.87184925e+01, -2.39909816e+00, -7.70643520e+00,  1.87949905e+01,
    0\.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
    0\.00000000e+00,  0.00000000e+00]), 
'left_rectified': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), 
'right_rectified': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}

#####   Extrinsics  #####

{'left_to_left_rectified': matrix([[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]]), 

 'left_rectified_to_left': matrix([[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]]), 

 'right_to_right_rectified': matrix([[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]]), 

 'right_rectified_to_right': matrix([[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]]),

  'rgb_to_left': matrix([
    [ 9.99790311e-01, -2.07729940e-03,  2.03714576e-02,  3.71858263e+00],
    [ 2.25434313e-03,  9.99959886e-01, -8.67166370e-03, 6.50952458e-02],
    [-2.03526262e-02,  8.71577021e-03,  9.99754846e-01, 2.67473280e-01],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, 1.00000000e+00]]), 

 'left_to_rgb': matrix([
    [ 9.99790311e-01,  2.25434313e-03, -2.03526262e-02, -3.71250582e+00],
    [-2.07729940e-03,  9.99959886e-01,  8.71577021e-03, -5.96992597e-02],
    [ 2.03714576e-02, -8.67166370e-03,  9.99754846e-01, -3.42596173e-01],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, 1.00000000e+00]]), 

'left_to_right': matrix([
    [ 9.99977887e-01, -3.74306655e-05, -6.65248651e-03, -7.43405294e+00],
    [ 1.20629622e-04,  9.99921799e-01,  1.25064831e-02, -6.17185049e-02],
    [ 6.65149791e-03, -1.25070084e-02,  9.99899685e-01, -1.17352411e-01],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, 1.00000000e+00]]), 

'right_to_left': matrix([
    [ 9.99977887e-01,  1.20629622e-04,  6.65149791e-03, 7.43467665e+00],
    [-3.74306655e-05,  9.99921799e-01, -1.25070084e-02, 5.99676892e-02],
    [-6.65248651e-03,  1.25064831e-02,  9.99899685e-01, 6.86575770e-02],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, 1.00000000e+00]]), 

 'right_rectified_to_left_rectified': matrix([
    [ 9.99977887e-01, -3.74306655e-05, -6.65248651e-03, -7.43405294e+00],
    [ 1.20629622e-04,  9.99921799e-01,  1.25064831e-02, -6.17185049e-02],
    [ 6.65149791e-03, -1.25070084e-02,  9.99899685e-01, -1.17352411e-01],
    [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, 1.00000000e+00]])}

"""

I've tried to format it to be readable. I don't really know what most of these numbers are, so I don't know enough to determine if the numbers are all correct. I did find the documentation for the extrinsic matrix, and the values do look somewhat reasonable. These are the numbers returned by the camera, though, and I expect they should atleast be good enough to get a reasonably correct answer.

This is the function I'm using to try to transform a point from one frame to another. This is mostly ChatGPT's doing, unfortunately, and it took a lot of tweaking to make it generate the sequence of operations shown here. This sequence is incorrect, I know, but I'm not sure what sequence I should be using. Specifically, the code I was using originally (also ChatGPTs work, admittedly) did not involve normalization against frame size, and the resulting transformed points were going outside of the frame size of the target frame. I thought it might have something to do with frame sizes, and had ChatGPT do a normalization to produce this code. Unfortunately, the code also denormalizes before doing anything useful with the points, so we're essentially back to where the result points outside of the frame bounds.

"""

 def transformPoint(self, point, from_frame, to_frame):
    M_from = self.intrinsics[from_frame]
    D_from = self.dist_coeffs[from_frame]
    M_to = self.intrinsics[to_frame]
    D_to = self.dist_coeffs[to_frame]
    extrinsics = self.extrinsics[f'{from_frame}_to_{to_frame}']
    R = extrinsics[:3, :3]
    t = extrinsics[:3, 3]

    from_size = self.frame_sizes[from_frame]
    to_size = self.frame_sizes[to_frame]

    # Normalize the point to the range [0, 1] based on the from_frame size
    point_normalized = np.array([point[0] / from_size[0], point[1] / from_size[1]], dtype=np.float32)
    logging.debug(f"Normalized point: {point_normalized}")

    # Convert normalized 2D point to pixel coordinates in from_frame
    point_pixel = np.array([point_normalized[0] * from_size[0], point_normalized[1] * from_size[1]], dtype=np.float32)
    
    # Undistort the point
    point_undistorted = cv2.undistortPoints(np.array([[point_pixel]], dtype=np.float32), M_from, D_from, None, M_from)
    point_undistorted = np.append(point_undistorted[0][0], 1.0)
    logging.debug(f"Undistorted point: {point_undistorted}")

    # Convert to 3D point in from_frame coordinates
    point_3d = np.dot(np.linalg.inv(M_from), point_undistorted)
    logging.debug(f"Point in 3D (from_frame coords): {point_3d}")

    # Apply extrinsic transformation to get the point in to_frame coordinates
    point_transformed_3d = np.dot(R, point_3d) + t
    logging.debug(f"Point transformed in 3D (to_frame coords): {point_transformed_3d}")

    # Project transformed 3D point back to 2D in to_frame coordinates
    point_transformed_2d, _ = cv2.projectPoints(point_transformed_3d[:3], np.zeros(3), np.zeros(3), M_to, D_to)
    logging.debug(f"Point projected in 2D (to_frame): {point_transformed_2d}")

    # Convert projected 2D point to normalized coordinates in to_frame
    point_transformed_2d_normalized = np.array([point_transformed_2d[0][0][0] / to_size[0], point_transformed_2d[0][0][1] / to_size[1]], dtype=np.float32)
    
    # Denormalize the point back to the pixel coordinates of the to_frame
    point_denormalized = np.array([point_transformed_2d_normalized[0] * to_size[0], point_transformed_2d_normalized[1] * to_size[1]], dtype=np.float32)
    logging.debug(f"Denormalized point: {point_denormalized}")

    return point_denormalized

"""

Here's an example of one transformation with plenty of debug

Trying to draw aruco on left
ORIGINAL
(array([[[323., 448.],
         [616., 337.],
         [745., 585.],
         [467., 719.]]], dtype=float32),)
...
Transforming point [745. 585.] from rgb to left
Intrinsics from: [[2.97167920e+03 0.00000000e+00 2.00892078e+03]
                  [0.00000000e+00 2.97050610e+03 1.00848987e+03]
                  [0.00000000e+00 0.00000000e+00 1.00000000e+00]], 
Distortion from: [-1.98579645e+00 -5.05813503e+00 -1.44903013e-03  3.96117466e-05
                  3.41700439e+01 -2.07788634e+00 -4.72139835e+00  3.36404686e+01
                  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
                  0.00000000e+00  0.00000000e+00]
Intrinsics to: [[451.64602661   0.         336.55810547]
                         [  0.         451.57305908 241.29995728]
                         [  0.           0.           1.        ]], 
Distortion to: [-8.79453564e+00  2.61839314e+01 -2.68959237e-04  1.66668301e-03
                -1.02677860e+01 -8.80316162e+00  2.62347431e+01 -1.04048386e+01
                0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
                0.00000000e+00  0.00000000e+00]
Extrinsics: [[ 9.99790311e-01 -2.07729940e-03  2.03714576e-02  3.71858263e+00]
                    [ 2.25434313e-03  9.99959886e-01 -8.67166370e-03  6.50952458e-02]
                    [-2.03526262e-02  8.71577021e-03  9.99754846e-01  2.67473280e-01]
                    [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]]
Normalized point: [0.38802084 0.5416667 ]
Undistorted point: [762.49359131 591.69934082   1.        ]
Point in 3D (from_frame coords): [-0.41943531 -0.1403096   1.        ]
Point transformed in 3D (to_frame coords): [[ 3.31989819  3.56866144  4.72565118]
                                            [-0.33358919 -0.08482594  1.0721638 ]
                                            [-0.13121116  0.11755209  1.27454183]]
Point projected in 2D (to_frame): [[[659.47425238 587.36658511]]
                                   [[196.05700567 205.54649539]]
                                   [[290.08669194 282.93753844]]]
Denormalized point: [659.47424 587.3666 ]
...
TRANSFORMED
(array([[[645.196  , 582.2823 ],
         [655.05133, 579.08887],
         [659.47424, 587.3666 ],
         [650.25116, 591.37274]]], dtype=float32),)
Point [645.196  582.2823] out of bounds for frame of size (640, 400)

I've been trying to find documentation or an example that tells me how this can be done, but have so far come up empty. If there is something I've missed, please point me in the right direction.

I've pushed the full code I'm using for testing on github : chintal/depthai-sandbox . Note that this code will need PyQt6 in addition to DepthAI and OpenCV to run. Everything outside of DepthAI is pip installable, but I haven't validated what exactly the dependencies are yet.

jakaskerl

Hi chintal

chintal I understand (please correct me if I'm wrong) that the depth and disparity frames are aligned to the Left Rectified frame.

That is by default. You will probably benefit immensely from aligning the depth to color stereo.setDepthAlign(rgbCamSocket).

Thoughts?

chintal

jakaskerl That is by default. You will probably benefit immensely from aligning the depth to color

Thanks! That will certainly solve the immediate problem, yes. And I will probably do that for now.

That said, having a way to move between frames seems a useful thing to be able to do, so if there is a way to do so I would like to have it available to me. I think RGB <-> left <-> right should be things that can be done with the camera calibration itself. I would also like to work out RGB <-> RGB preview later on, because the NNs seems to generally run on (300,300) frames.

jakaskerl

Hi @chintal
We do have an alignment node which is abstracts what you are trying to achieve. It needs depth information (reprojection doesn't work otherwise), but it is able to align rgb-left-right-depth-....

chintal I would also like to work out RGB <-> RGB preview later on, because the NNs seems to generally run on (300,300) frames.

Cropping and resizing works as described here: https://docs.luxonis.com/software/depthai/resolution-techniques/.

Thanks,
Jaka