Hi erik ,
so far I found two approaches, Direct Linear Transform (DLT) and the Projective 3 Point Algorithm (P3P).
With DLT you can approximate the intrinsic and extrinsic camera matrices, given at least six points in the picture / video and their corresponding world coordinates. That's the method the tutorial is using I referenced a few post ago.
With the P3P algorithm you approximate the extrinsic camera matrix, given three points and their corresponding world coordinates and the intrinsic camera matrix. Since we know our intrinsic matrix this would probably be the way to go.
Here are 5 minute intro videos about DLT and P3P.
And these are the extended video lectures about DLT and P3P.
Harnessing the capabilities of the IMU would be a blast! Then the only missing piece is the position of the camera, which we could meassure ourselvs, which could be a big timesaver plus we wouldn't need to carry around (huge?) checkerboard patterns 😉.
I assume that for the depicted scene in my inital posting the checkerboard needs to be quite large for the calibration process to recognize the pattern?! Do you know a formular to estimate the minimum size of the checkerboard squares, given the camera's intrinsic matrix and the distance from the camera?
I asked ChatGPT:
Q: How to get the camera translation matrix from the camera intrinsic matrix and the camera rotation matrix, without knowing the translation vector 't'?
Without knowing the translation vector t, it is not possible to compute the full camera matrix.
However, if you have some information about the scene or the camera setup, you may be able to estimate the translation vector t using techniques such as visual odometry or structure from motion. Once you have an estimate for t, you can compute the full camera matrix.
Maybe it would suffice to use the pictures from the stereo cameras to approximate the position of the camera in relation to features? But I haven't followed this rabbit hole (yet).
Kind regards