Getting the camera extrinsics afterwards

Aingu7ae

Hi, does anybody know a tool or have code to get the extrinsics afterwards?

erik

Hi Aingu7ae !
You'd need to calibrate the device to some other coordiante system, perhaps easiest would be to use multi-cam calibration script. Then you could transform eg. pointclouds or 3d object detections to your world coordinate, which fits with the image above. Thoughts?
Thanks,. Erik

Aingu7ae

Hello erik,

thank you for taking an interest 🙂 Yeah, I did the multi-cam calibration which gave me a few matrices.

world_to_cam
[[ 0.61411884 0.7890858 -0.01420041 -0.21579439]
[-0.00386166 0.02099733 0.99977207 0.31440464]
[ 0.78920412 -0.61392403 0.01594204 1.05200151]
[ 0. 0. 0. 1. ]]

cam_to_world
[[ 0.61411884 -0.00386166 0.78920412 -0.6965064 ]
[ 0.7890858 0.02099733 -0.61392403 0.80952764]
[-0.01420041 0.99977207 0.01594204 -0.33416839]
[ 0. 0. 0. 1. ]]

trans_vec
[[-0.21579439]
[ 0.31440464]
[ 1.05200151]]

rot_vec
[[-1.43083528]
[-0.71236433]
[-0.70309224]]

intrinsics_mat
[[1546.18994140625, 0.0, 962.3573608398438]
[0.0, 1546.18994140625, 531.37255859375]
[0.0, 0.0, 1.0]]

I guess first I have to manage to create the pointcloud from the video material. I recorded the scene with gen2-record-replay. Is gen2-pointcloud a good starting place to do so?

How is the workflow to transform the world coordinates once I aquired the pointcloud?

I found this tutorial about geometric camera calibration (Camera Calibration with Example in Python). I presume this is another approach? The math looks a little scary. Given 6 points in the image plane and their corresponding world coordinates, it should be possible to reconstruct the extrinsic matrix. The points have to be independent but I don't know what it means in this context. I can get the world coordinates for 6 points in the image, but how do I check for independence? I thought there are no more then 3 independent vectors in 3D space?! I'm feeling a little lost here and don't know what to learn first to fill my blanks. If you or other readers got a pointer that would be nice!

jakaskerl

Hi Aingu7ae
You need to convert pointcloud coordinates (xyz in camera coordinates) to world coordinates.
For that you use the transformation matrix calculated with multi-cam-example script. Multiplying camera to world matrix with position vector for each point in pointcloud should give you the position vector of that same point in the chessboard's frame of reference. You can that use that real world reference frame to find out the position of objects in the room.

The tutorial you mentioned is just another way of doing the same thing but you need a few points and their positions. Points being independent just means that no 3 points are collinear (which is almost impossible to do in 3D space). Then just solve a system of equations as mentioned in the tutorial and you should get the transform matrix coefficients.

Hope this helps,
Jaka

erik

cc @jakaskerl ^

Aingu7ae

Hi jakaskerl, thank you for your explanation, I really appreciate that.

Here is what I have trouble wrapping my mind around. Although I did the chessboard calibration at a different location afterwards and with a completely different camera pose, the process you describe (if I understand you: multiplying the points/vectors of the pointcloud with the cam-to-world matrix of the chessboard calibration) will yield the real world positions of the pointcloud points in relation to the (first) camera at the ceiling which recorded the original video material and that had a different pose than the calibration camera?

jakaskerl

Hi Aingu7ae
The camera needs to be calibrated at the same location and with the same camera position and orientation. The extrinsic matrix describes the conversion from world coordinate system to camera coordinate system which means it also includes the translation in xyz from world to camera. That changes when you move the camera or the chessboard so results are no longer valid. What stays the same are the camera intrinsics. That means you have to place the chessboard at the scene where you recorded otherwise the transformations are completely wrong.

Hope this helps,
Jaka

Aingu7ae

Hi jakaskerl

Trying to do the chessboard calibration afterwards is not optimal in my situation, unfortunately. I would have to reattach and try to align the camera to the orignal position in the video footage and this will most certainly introduce a calibration error since I never gonna manage to align the camera in exactly the same way. Besides getting access to the location and a big enough time slot to do so poses the even bigger challenge.

This is why I'm looking for a method to get the extrinsics from the video material. Sorry, I just realize that the prior sentence would have been a better title for my question.

It would probably suffice to just reconstruct the rotation matrix of the camera, since all the other variables like intrinsic matrix, translation vector are known (more or less 😉).

What about the calib.json file from the gen2-record-replay scripts I used for the recording? Is something in there what we could use to reconstruct the rotation matrix?

This afternoon I came across another interesting approach which is called vanishing point camera calibration . But a drawback with that method seems that we can compute jaw and pitch from the image but not the roll... so this approach might be not so usefull after all.

If you or anybody else reading this knows other methods worth exploring I'd be happy to try!

erik

Hi Aingu7ae ,
How would you calculate the extrinsics (local coordinate system to global system) from the video? One potential approach would be to use IMU to determine the tilt of the camera, but the camera still wouldn't know "where it is" relative to the floorplan. Thoughts?
Thanks, Erik

Aingu7ae

Hi erik ,

so far I found two approaches, Direct Linear Transform (DLT) and the Projective 3 Point Algorithm (P3P).

With DLT you can approximate the intrinsic and extrinsic camera matrices, given at least six points in the picture / video and their corresponding world coordinates. That's the method the tutorial is using I referenced a few post ago.

With the P3P algorithm you approximate the extrinsic camera matrix, given three points and their corresponding world coordinates and the intrinsic camera matrix. Since we know our intrinsic matrix this would probably be the way to go.

Here are 5 minute intro videos about DLT and P3P.
And these are the extended video lectures about DLT and P3P.

Harnessing the capabilities of the IMU would be a blast! Then the only missing piece is the position of the camera, which we could meassure ourselvs, which could be a big timesaver plus we wouldn't need to carry around (huge?) checkerboard patterns 😉.

I assume that for the depicted scene in my inital posting the checkerboard needs to be quite large for the calibration process to recognize the pattern?! Do you know a formular to estimate the minimum size of the checkerboard squares, given the camera's intrinsic matrix and the distance from the camera?

I asked ChatGPT:

Q: How to get the camera translation matrix from the camera intrinsic matrix and the camera rotation matrix, without knowing the translation vector 't'?

Without knowing the translation vector t, it is not possible to compute the full camera matrix.

However, if you have some information about the scene or the camera setup, you may be able to estimate the translation vector t using techniques such as visual odometry or structure from motion. Once you have an estimate for t, you can compute the full camera matrix.

Maybe it would suffice to use the pictures from the stereo cameras to approximate the position of the camera in relation to features? But I haven't followed this rabbit hole (yet).

Kind regards

erik

Hi Aingu7ae ,
Yep, I think the chessboard approach using multi-cam calibration uses the P3P technique (under the hood), as we already have camera intrinsics known. We have used chessboard printed onto A4 sheet and it worked up to a few meters (looking at Spatail Detection Fusion demo).

Maybe it would suffice to use the pictures from the stereo cameras to approximate the position

It would be, but you need some known point so you can set that as the global coordinate system, so it would be similar to the chessboard approach.

Thanks, Erik

jakaskerl

Hi Aingu7ae

Do you know a formula to estimate the minimum size of the checkerboard squares

Theoretically there isn't a minimum size as long as there is at least one pixel per board square. What happens is that with distance, the accuracy of estimation drastically decreases (the closer the board is to the camera, the better the "pixel per square" , the better the extrinsic matrix approximation).
You can use a small checkerboard, but put it close to the camera at known xyz, and calculate the world coordinates from there, but you might get a rather large error when estimating xyz in the distance (due to camera and perspective distortions).
So yeah, ideally you would put a large checkerboard at the back of the room :/. There are tradeoffs in every approach.