Hello,
I've recently obtained a OAK-D Lite camera for an object size measurement application.
This is the flow I want some help with :
- I use cv2.aruco to detect Aruco markers in the RGB frame on the host. I want to use the detected Aruco markers to identify a plane. I know apriori they will all be coplanar.
- I understand (please correct me if I'm wrong) that the depth and disparity frames are aligned to the Left Rectified frame.
- I want to take these detections from the RGB frame into the Rectified Left frame, and use that from the depth frame to determine the spatial locations of the Aruco markers to get a reliable fix on the plane.
- To start, I'm simply trying to draw the detections on the Rectified left frame to ensure the transformations are correct.
This is what I have so far, with a lot of help from ChatGPT (I'm sure all of these must be documented somewhere, but wasn't able to find them. I looked mostly on the Luxonis website and the depthAI docs) :
Loading calibration data from the camera
"""
def loadCalibrationData(self):
calib_data = self.device.readCalibration()
# Intrinsics
self.intrinsics['rgb'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.RGB))
self.intrinsics['left'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.LEFT))
self.intrinsics['right'] = np.array(calib_data.getCameraIntrinsics(depthai.CameraBoardSocket.RIGHT))
# Distortion coefficients
self.dist_coeffs['rgb'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.RGB))
self.dist_coeffs['left'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.LEFT))
self.dist_coeffs['right'] = np.array(calib_data.getDistortionCoefficients(depthai.CameraBoardSocket.RIGHT))
# Assume rectified frames have the same intrinsics as the original left/right frames (?)
self.intrinsics['left_rectified'] = self.intrinsics['left']
self.intrinsics['right_rectified'] = self.intrinsics['right']
# No distortion in rectified images
self.dist_coeffs['left_rectified'] = np.zeros_like(self.dist_coeffs['left'])
self.dist_coeffs['right_rectified'] = np.zeros_like(self.dist_coeffs['right'])
identity_extrinsic = [ [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]
identity_extrinsic_matrix = np.matrix(identity_extrinsic)
# Add identity extrinsics for rectified transformations
self.extrinsics['left_to_left_rectified'] = identity_extrinsic_matrix
self.extrinsics['left_rectified_to_left'] = identity_extrinsic_matrix
self.extrinsics['right_to_right_rectified'] = identity_extrinsic_matrix
self.extrinsics['right_rectified_to_right'] = identity_extrinsic_matrix
# Extrinsics
self.extrinsics['rgb_to_left'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.RGB, depthai.CameraBoardSocket.LEFT))
self.extrinsics['left_to_rgb'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.LEFT, depthai.CameraBoardSocket.RGB))
self.extrinsics['left_to_right'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.LEFT, depthai.CameraBoardSocket.RIGHT))
self.extrinsics['right_to_left'] = np.matrix(calib_data.getCameraExtrinsics(depthai.CameraBoardSocket.RIGHT, depthai.CameraBoardSocket.LEFT))
self.extrinsics['right_rectified_to_left_rectified'] = self.extrinsics['left_to_right'] # Same as left to right
"""
This results in this data being loaded, with the frame sizes manually set (Code sets them to THE_400_P and THE_1080_P in the pipeline) :
"""
##### Frame Sizes #####
{'left': (640, 400), 'right': (640, 400), 'depth': (640, 400), 'disparity': (640, 400), 'rgb': (1920, 1080)}
##### Intrinsics #####
{'rgb': array([[2.97167920e+03, 0.00000000e+00, 2.00892078e+03],
[0.00000000e+00, 2.97050610e+03, 1.00848987e+03],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]),
'left': array([[451.64602661, 0. , 336.55810547],
[ 0. , 451.57305908, 241.29995728],
[ 0. , 0. , 1. ]]),
'right': array([[449.44226074, 0. , 312.17462158],
[ 0. , 449.58377075, 239.12860107],
[ 0. , 0. , 1. ]]),
'left_rectified': array([[451.64602661, 0. , 336.55810547],
[ 0. , 451.57305908, 241.29995728],
[ 0. , 0. , 1. ]]),
'right_rectified': array([[449.44226074, 0. , 312.17462158],
[ 0. , 449.58377075, 239.12860107],
[ 0. , 0. , 1. ]])}
##### Dist Coeffs #####
{'rgb': array([-1.98579645e+00, -5.05813503e+00, -1.44903013e-03, 3.96117466e-05,
3\.41700439e+01, -2.07788634e+00, -4.72139835e+00, 3.36404686e+01,
0\.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0\.00000000e+00, 0.00000000e+00]),
'left': array([-8.79453564e+00, 2.61839314e+01, -2.68959237e-04, 1.66668301e-03,
-1.02677860e+01, -8.80316162e+00, 2.62347431e+01, -1.04048386e+01,
0\.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0\.00000000e+00, 0.00000000e+00]),
'right': array([-2.41038156e+00, -7.64782906e+00, 2.54874420e-03, 7.66101934e-04,
1\.87184925e+01, -2.39909816e+00, -7.70643520e+00, 1.87949905e+01,
0\.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0\.00000000e+00, 0.00000000e+00]),
'left_rectified': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
'right_rectified': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}
##### Extrinsics #####
{'left_to_left_rectified': matrix([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]),
'left_rectified_to_left': matrix([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]),
'right_to_right_rectified': matrix([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]),
'right_rectified_to_right': matrix([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]),
'rgb_to_left': matrix([
[ 9.99790311e-01, -2.07729940e-03, 2.03714576e-02, 3.71858263e+00],
[ 2.25434313e-03, 9.99959886e-01, -8.67166370e-03, 6.50952458e-02],
[-2.03526262e-02, 8.71577021e-03, 9.99754846e-01, 2.67473280e-01],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]),
'left_to_rgb': matrix([
[ 9.99790311e-01, 2.25434313e-03, -2.03526262e-02, -3.71250582e+00],
[-2.07729940e-03, 9.99959886e-01, 8.71577021e-03, -5.96992597e-02],
[ 2.03714576e-02, -8.67166370e-03, 9.99754846e-01, -3.42596173e-01],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]),
'left_to_right': matrix([
[ 9.99977887e-01, -3.74306655e-05, -6.65248651e-03, -7.43405294e+00],
[ 1.20629622e-04, 9.99921799e-01, 1.25064831e-02, -6.17185049e-02],
[ 6.65149791e-03, -1.25070084e-02, 9.99899685e-01, -1.17352411e-01],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]),
'right_to_left': matrix([
[ 9.99977887e-01, 1.20629622e-04, 6.65149791e-03, 7.43467665e+00],
[-3.74306655e-05, 9.99921799e-01, -1.25070084e-02, 5.99676892e-02],
[-6.65248651e-03, 1.25064831e-02, 9.99899685e-01, 6.86575770e-02],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00]]),
'right_rectified_to_left_rectified': matrix([
[ 9.99977887e-01, -3.74306655e-05, -6.65248651e-03, -7.43405294e+00],
[ 1.20629622e-04, 9.99921799e-01, 1.25064831e-02, -6.17185049e-02],
[ 6.65149791e-03, -1.25070084e-02, 9.99899685e-01, -1.17352411e-01],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])}
"""
I've tried to format it to be readable. I don't really know what most of these numbers are, so I don't know enough to determine if the numbers are all correct. I did find the documentation for the extrinsic matrix, and the values do look somewhat reasonable. These are the numbers returned by the camera, though, and I expect they should atleast be good enough to get a reasonably correct answer.
This is the function I'm using to try to transform a point from one frame to another. This is mostly ChatGPT's doing, unfortunately, and it took a lot of tweaking to make it generate the sequence of operations shown here. This sequence is incorrect, I know, but I'm not sure what sequence I should be using. Specifically, the code I was using originally (also ChatGPTs work, admittedly) did not involve normalization against frame size, and the resulting transformed points were going outside of the frame size of the target frame. I thought it might have something to do with frame sizes, and had ChatGPT do a normalization to produce this code. Unfortunately, the code also denormalizes before doing anything useful with the points, so we're essentially back to where the result points outside of the frame bounds.
"""
def transformPoint(self, point, from_frame, to_frame):
M_from = self.intrinsics[from_frame]
D_from = self.dist_coeffs[from_frame]
M_to = self.intrinsics[to_frame]
D_to = self.dist_coeffs[to_frame]
extrinsics = self.extrinsics[f'{from_frame}_to_{to_frame}']
R = extrinsics[:3, :3]
t = extrinsics[:3, 3]
from_size = self.frame_sizes[from_frame]
to_size = self.frame_sizes[to_frame]
# Normalize the point to the range [0, 1] based on the from_frame size
point_normalized = np.array([point[0] / from_size[0], point[1] / from_size[1]], dtype=np.float32)
logging.debug(f"Normalized point: {point_normalized}")
# Convert normalized 2D point to pixel coordinates in from_frame
point_pixel = np.array([point_normalized[0] * from_size[0], point_normalized[1] * from_size[1]], dtype=np.float32)
# Undistort the point
point_undistorted = cv2.undistortPoints(np.array([[point_pixel]], dtype=np.float32), M_from, D_from, None, M_from)
point_undistorted = np.append(point_undistorted[0][0], 1.0)
logging.debug(f"Undistorted point: {point_undistorted}")
# Convert to 3D point in from_frame coordinates
point_3d = np.dot(np.linalg.inv(M_from), point_undistorted)
logging.debug(f"Point in 3D (from_frame coords): {point_3d}")
# Apply extrinsic transformation to get the point in to_frame coordinates
point_transformed_3d = np.dot(R, point_3d) + t
logging.debug(f"Point transformed in 3D (to_frame coords): {point_transformed_3d}")
# Project transformed 3D point back to 2D in to_frame coordinates
point_transformed_2d, _ = cv2.projectPoints(point_transformed_3d[:3], np.zeros(3), np.zeros(3), M_to, D_to)
logging.debug(f"Point projected in 2D (to_frame): {point_transformed_2d}")
# Convert projected 2D point to normalized coordinates in to_frame
point_transformed_2d_normalized = np.array([point_transformed_2d[0][0][0] / to_size[0], point_transformed_2d[0][0][1] / to_size[1]], dtype=np.float32)
# Denormalize the point back to the pixel coordinates of the to_frame
point_denormalized = np.array([point_transformed_2d_normalized[0] * to_size[0], point_transformed_2d_normalized[1] * to_size[1]], dtype=np.float32)
logging.debug(f"Denormalized point: {point_denormalized}")
return point_denormalized
"""
Here's an example of one transformation with plenty of debug
Trying to draw aruco on left
ORIGINAL
(array([[[323., 448.],
[616., 337.],
[745., 585.],
[467., 719.]]], dtype=float32),)
...
Transforming point [745. 585.] from rgb to left
Intrinsics from: [[2.97167920e+03 0.00000000e+00 2.00892078e+03]
[0.00000000e+00 2.97050610e+03 1.00848987e+03]
[0.00000000e+00 0.00000000e+00 1.00000000e+00]],
Distortion from: [-1.98579645e+00 -5.05813503e+00 -1.44903013e-03 3.96117466e-05
3.41700439e+01 -2.07788634e+00 -4.72139835e+00 3.36404686e+01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
Intrinsics to: [[451.64602661 0. 336.55810547]
[ 0. 451.57305908 241.29995728]
[ 0. 0. 1. ]],
Distortion to: [-8.79453564e+00 2.61839314e+01 -2.68959237e-04 1.66668301e-03
-1.02677860e+01 -8.80316162e+00 2.62347431e+01 -1.04048386e+01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
Extrinsics: [[ 9.99790311e-01 -2.07729940e-03 2.03714576e-02 3.71858263e+00]
[ 2.25434313e-03 9.99959886e-01 -8.67166370e-03 6.50952458e-02]
[-2.03526262e-02 8.71577021e-03 9.99754846e-01 2.67473280e-01]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]
Normalized point: [0.38802084 0.5416667 ]
Undistorted point: [762.49359131 591.69934082 1. ]
Point in 3D (from_frame coords): [-0.41943531 -0.1403096 1. ]
Point transformed in 3D (to_frame coords): [[ 3.31989819 3.56866144 4.72565118]
[-0.33358919 -0.08482594 1.0721638 ]
[-0.13121116 0.11755209 1.27454183]]
Point projected in 2D (to_frame): [[[659.47425238 587.36658511]]
[[196.05700567 205.54649539]]
[[290.08669194 282.93753844]]]
Denormalized point: [659.47424 587.3666 ]
...
TRANSFORMED
(array([[[645.196 , 582.2823 ],
[655.05133, 579.08887],
[659.47424, 587.3666 ],
[650.25116, 591.37274]]], dtype=float32),)
Point [645.196 582.2823] out of bounds for frame of size (640, 400)
I've been trying to find documentation or an example that tells me how this can be done, but have so far come up empty. If there is something I've missed, please point me in the right direction.
I've pushed the full code I'm using for testing on github : chintal/depthai-sandbox . Note that this code will need PyQt6 in addition to DepthAI and OpenCV to run. Everything outside of DepthAI is pip installable, but I haven't validated what exactly the dependencies are yet.