Design about YOLO detect object and estimate world coords

OseongKwon

I want to use OAK-D Pro PoE and YOLOv8 or 9 instance segmentation model to detect metal cubes:

1. Detect objects using ColorCamera

2. Calculate world coordinates using depthFrame of monoCameras.

To achieve this goal, I referred to the following documents.

## depthAI Reference Documents

1. multi-cam calibration

estimate extrinsics["cam_to_world"] using colorCamera intrinsic_mat @ (3840, 2160)
**intrinsic_mat is dependent of camera resolution.**

2. Calc spatials on host

extrac HFOV in dai.CameraBoardSocket.CAM_C.
estimate [X{cam}, Y{cam}, Z_{cam}] using depthFrame.
**HFOV is independent of camera resolution.**

3. rgb-depth aligned

set same resolution of colorCamera, monoCameras with 720P.
but colorCamera can't support 720P resolution, so setted 1080P
stereo DepthAlign to rgbCamSocket.
all cameras resolution has setted 1080P

4. YOLO Segment & Depth | OAK-D Pro PoE

## Developement Plan and Question

1. cameraPose.py generate extrinsics["cam_to_world"] with checkerboard.

get and save extrinsics["cam_to_world"]
but about What camera, ColorCamera or MonoCamera intrinsic_mat ?
and What resolution is setted 1080P, 720P ?
Or stereo's rectifiedRight or syncedRight in the same pipeline as main.py's pipeline?

2. main.py detect and estimate world coordinates.

host detect cube using rgbFrame.
- then I can use any size, any resolution model images of model input.
Get postprocessed depth with stereo DepthAlign to rgbCamSocket
- but when we have got extrinsics["cam_to_world"] for CAM_C
- Doesn't stereo DepthAlign affect extrinsics["cam_to_world"] matrix?
- If I set depthAlign to CAM_A in stereoDepth, does the depth information become Z_cam of ColorCamers?
Estimate camera cooordinates using depthFrame, BBox, segementation mask.
Finally, camera coords are converted to world coords using extrinsics["cam_to_world"].

Are there any additional issues that need to be carefully considered in the above plan?

jakaskerl

OseongKwon but about What camera, ColorCamera or MonoCamera intrinsic_mat ?

Intrinsics are calibrated in factory. No need to redo.

OseongKwon and What resolution is setted 1080P, 720P ?

getCameraIntrinics allows you to set custom W/H - so intrinsics are scaled. W/H are up to you.

OseongKwon Or stereo's rectifiedRight or syncedRight in the same pipeline as main.py's pipeline?

Not sure what you mean.

OseongKwon Doesn't stereo DepthAlign affect extrinsics["cam_to_world"] matrix?

No, it just changes the matrix to the one to which the image is aligned to (so if aligning to CAM_A, RGB's cam_to_world matrix should be used).

OseongKwon Z_cam of ColorCamers?

No sure i understand.

Thanks,
Jaka

OseongKwon

multi-cam calibration document guides you through creating an external matrix (R, t) using the checkerboard and the intrinsic matrix.
Calc spatials on host walks through how to use depthframe to compute camera coordinates (X_cam, Y_cam, Z_cam) for an ROI.
Generally, colorframe and depthframe have different sizes and views, and I understand that the rgb-depth aligned document guides how to align the depthframe to the colorframe so that the locations pointed to by each pixel are equivalent.
Usually depthframe is used as Z_cam. The question is, if I align the stereoDepth node to CAM_A, wouldn't each pixel value of depthframe become the Z_cam of CAM_A, i.e. ColorCamera?
If the above assumption is correct, the depthframe will be the Z_cam of CAM_A (i.e. ColorCamera), and using this, we can generate the camera coordinates (X_cam, Y_cam, Z_cam) for CAM_A according to method 2, and then generate the world coordinates (X, Y, Z) using the external matrix of CAM_A generated by method 1, right?
If the above assumption is wrong and the depthframe still represents the Z_cam of CAM_C (default), then we need to use some method to map the BBox information detected by the colorframe to the depthframe. How should we do this?

jakaskerl

OseongKwon

OseongKwon Usually depthframe is used as Z_cam. The question is, if I align the stereoDepth node to CAM_A, wouldn't each pixel value of depthframe become the Z_cam of CAM_A, i.e. ColorCamera?

Correct.

OseongKwon If the above assumption is correct, the depthframe will be the Z_cam of CAM_A (i.e. ColorCamera), and using this, we can generate the camera coordinates (X_cam, Y_cam, Z_cam) for CAM_A according to method 2, and then generate the world coordinates (X, Y, Z) using the external matrix of CAM_A generated by method 1, right?

correct