Accurate Translational Velocity

KarrsonHeumann

Hey! I'm looking for a highly accurate solution to get the translational velocity
from a standard OAK-D sensor. I know IMUs tend to only be so accurate
whereas visual data can help. The thing about my use case, is that I need it to
not try to relocalize, but to rather give it's best guess from the last timestamp
to the current timestamp. I'm not sure what's the best way to go about this.

jakaskerl

Hi KarrsonHeumann
To use visual data for tracking you would need landmarks (aruco, charuco markers) to achieve anything close to high accuracy.
Perhaps optical flow methods could help you in inferring velocity --> coupling it with IMU.
It would help if you were to lower the report frequency of the IMU and apply statistical filtering to each report batch, but i'm not sure about the accuracy you will get.
This becomes a very difficult task if you need high accuracy/precision.

Thoughts?
Jaka

KarrsonHeumann

Yes, everything you said makes a lot of sense I think. I will try using IMU+Optical Flow, and at a lower frequency.
Funny enough, I was also thinking this would probably be the way to go, so it's great to get a confirmation on this.
Thank you very much, I will try this idea 🙂

KarrsonHeumann

Hey! Can you please give my code a quick review?
It definitely has something going for it, but I feel like it could probably be better. 🙂

jakaskerl

Hi KarrsonHeumann
I'd try to up the report frequency and apply some sort of statistical filtering to the IMU (median most likely) as the values can be quite noisy at times. You might also need to periodically clear the data from IMU because integrating the values (with error) will cause a drift over time.
I'm not too knowledgeable in this topic but I hope it helps.

Thanks,
Jaka

KarrsonHeumann

jakaskerl Thank you very much! And thanks for reminding me about statistical filtering. I completely forgot to do that. Well, I'll let you get back to answering other people's questions, but just know that you have been extremely helpful here and I really value it. 🙂

Simone

KarrsonHeumann Hi,

maybe I have the same need as you, can you upload an example of how you did it?

KarrsonHeumann

The speed of the movement is usually dependent on the use case.
Usually things, even people, move at a fairly constant speed given a specific use case.
So as long as I can get the direction of movement,
I can multiply it by the constant speed to get the velocity.
The direction can be determined using OpenCV's SolvePnP function,
And the 3D corners of a camera model
(In my case I'm literally getting this from a game engine by fitting a plane to the camera)
where once the 3D corners are projected into 2D space, they are the final 2D corners of the view.
These 3D corners of the image frame can be considered the world points and
are constant for a given camera model, near the origin.
The function also requires an intrinsic matrix and distortion coefficients.
Ultimately, I only need a single monocular camera and IMU for this so I just
use the left or right one and either take the data from CalibrationReader or
I could easily estimate it since the direction vector doesn't require a perfect calibration.
The intrinsic matrix is:
fx 0 cx
0 fy cy
0 0 1
where fx is the width and fy is the height
and cx is the center x and cy is the center y. (dimensions divided by two)
The distortion coeffecients can probably just be five zeros in one row.
Given a homography, you can warp the 2D image corners mathematically and put that into SolvePnP.
The correct homography can be estimated using the OpenCV image stitcher.
It's the R matrix from the camera parameters in stitcher.cameras().
If you do it that way you will need C++ for that part as OpenCV doesn't expose that functionality
to the bindings unfortunately.
SolvePnP can be used with DLT as the method (Direct Least Squares)
Which will return only a projected translation.
You can then remove a certain amount of translation based on the detected rotation angles
because the rotation affects this translation and it really shouldn't.
That is a function which you actually will need to calibrate and fit.
Then, you can simply normalize the translation vector and multiply by the constant speed.
I know this isn't code, but I have been working for a long time on this
and my code is proprietary.
However, combined with some smooth linear interpolation (known as Slerp),
You can accumulate a very nice trajectory.
Combine it with pose graph optimization, and you have a good pose for the sensor.
That part is explained in a MATLAB video on YouTube and it's really just soft body physics.

Sorry for the complicated explanation.
I promise it's not rocket science once you get into it xD

I do however have a slightly suboptimal public mapping software I wrote last year
which you can totally use for free, but it was highly experimental.
Maybe you can contribute any improvements?
karrson/OdoToy

KarrsonHeumann

OK… I made the realization that my previous post was a SERIOUS SIMPLIFICATION and was honestly TERRIBLE…

Here is what I added to my engineering notebook just now to remind me (This is actually correct):

Matrix Blunder

Karrson: I just realized...

Matrices are surprisingly like vectors...

You can't take an NxN matrix, put it in an N+1xN+1 matrix and expect it to magically work...

It will not modify the N+1 dimension if that part is identity...

Data doesn't just magically appear but somehow after learning homogenous coords I was under the impression that it can xD

Emily: kinda the other way around - vectors are just 1xN matrices

Karrson: I know, but they are similar in that if you multiply a square matrix by a square matrix the extra dimension stays the same if it was originally identity

so dimenionality-wise they are similar

like multiplying 3, 5, 0 by 6, 2, 0 for example

the 0 part stays the same, same if its a 1

ofc in vectors its element-wise so that parts different

Karrson: Oh I think I get what you mean - yeah, you don't have to think of vectors differently at all if you know matrices

(other than extra work you might have to do)

Karrson: I guess what I'm saying is that a combination of dimensionality, points, vectors, homogenous coordinates, affine, projective, are all like apples and oranges where a matrix gets you from one frame to another. Apples don't multiply with oranges. They are different units.

and really, while homogenous coordinates may be a hack that allows you to multiply, it's not that hacky

so a 4x4 homogenous matrix really is a 3x3 in disguise, just that it has to be 4x4 to actually work.

So yes, I believe I have proven quite well that if you want a 3D pose that actually makes use of the world’s Z coordinates - which a 3x3 projection matrix or homography does NOT do btw because it’s really just homogenous 2D - then you will need a set of 3D affine points, so you really do NOT want to use projections at all for this part. That was my mistake. Actually, this WILL require a real calibration (intrinsics and extrinsics) rather than just using the image size to estimate that. Ofc, to go from raw stereo projected images to 3D affine pointclouds, you will need to work with projections, but after that the entire SLAM pipeline should be affine.

To quote stackoverflow (https://stackoverflow.com/a/56228667),

“First of all, 3 points are too little to recover affine transformation -- you need 4 points. For N-dimensional space there is a simple rule: to unambiguously recover affine transformation you should know images of N+1 points that form a simplex --- triangle for 2D, pyramid for 3D, etc. With 3 points you could only retrieve 2D affine transformation. A good explanation of why this is the case you may find in "Beginner's guide to mapping simplexes affinely".

Regarding some retrieval algorithm. I'm afraid, I don't know Matlab to provide you with appropriate code, but I worked with Python a little bit, maybe this code can help (sorry for bad codestyle -- I'm mathematician, not programmer)

import numpy as np

# input data

ins = [[1, 1, 2], [2, 3, 0], [3, 2, -2], [-2, 2, 3]] # <- points

out = [[0, 2, 1], [1, 2, 2], [-2, -1, 6], [4, 1, -3]] # <- mapped to

# calculations

l = len(ins)

B = np.vstack([np.transpose(ins), np.ones(l)])

D = 1.0 / np.linalg.det(B)

entry = lambda r,d: np.linalg.det(np.delete(np.vstack([r, B]), (d+1), axis=0))

M = [[(-1)**i * D * entry(R, i) for i in range(l)] for R in np.transpose(out)]

A, t = np.hsplit(np.array(M), [l-1])

t = np.transpose(t)[0]

# output

print("Affine transformation matrix:\n", A)

print("Affine transformation translation vector:\n", t)

# unittests

print("TESTING:")

for p, P in zip(np.array(ins), np.array(out)):

image_p = np.dot(A, p) + t

result = "[OK]" if np.allclose(image_p, P) else "[ERROR]"

print(p, " mapped to: ", image_p, " ; expected: ", P, result)

This code demonstrates how to recover affine transformation as matrix and vector and tests that initial points are mapped to where they should. It is based on equation presented in "Beginner's guide to mapping simplexes affinely", matrix recovery is described in section "Recovery of canonical notation". The same authors published "Workbook on mapping simplexes affinely" that contains many practical examples of this kind.“

Once you have a correct and verifiable affine transformation matrix, you should be able to decompose it into the correct rotation and translation of the camera quite easily.

KarrsonHeumann

I also just realized that you might not need to use homogenous coordinates at all for the 3D affine matrix, so it will look a lot like a 3x3 homography which is confusing.

P̶l̶e̶a̶s̶e̶ ̶a̶l̶s̶o̶ ̶k̶e̶e̶p̶ ̶i̶n̶ ̶m̶i̶n̶d̶ ̶t̶h̶a̶t̶ ̶t̶h̶i̶s̶ ̶m̶i̶g̶h̶t̶ ̶a̶s̶s̶u̶m̶e̶ ̶w̶o̶r̶l̶d̶ ̶c̶o̶o̶r̶d̶i̶n̶a̶t̶e̶s̶,̶ ̶b̶u̶t̶ ̶s̶t̶e̶r̶e̶o̶ ̶o̶n̶l̶y̶ ̶g̶i̶v̶e̶s̶ ̶y̶o̶u̶ ̶l̶o̶c̶a̶l̶ ̶3̶D̶.̶I̶'̶m̶ ̶n̶o̶t̶ ̶s̶u̶r̶e̶ ̶w̶h̶a̶t̶ ̶t̶h̶e̶ ̶b̶e̶s̶t̶ ̶w̶a̶y̶ ̶i̶s̶ ̶t̶o̶ ̶m̶a̶k̶e̶ ̶t̶h̶a̶t̶ ̶w̶o̶r̶k̶ ̶a̶t̶ ̶t̶h̶e̶ ̶m̶o̶m̶e̶n̶t̶.̶ ̶I̶ ̶d̶o̶ ̶k̶n̶o̶w̶,̶ ̶h̶o̶w̶e̶v̶e̶r̶,̶ ̶t̶h̶a̶t̶ ̶i̶f̶ ̶y̶o̶u̶ ̶u̶s̶e̶ ̶t̶h̶e̶ ̶I̶M̶U̶ ̶y̶o̶u̶ ̶c̶a̶n̶ ̶g̶e̶t̶ ̶a̶ ̶h̶i̶g̶h̶l̶y̶ ̶a̶c̶c̶u̶r̶a̶t̶e̶ ̶r̶o̶t̶a̶t̶i̶o̶n̶,̶ ̶r̶o̶t̶a̶t̶e̶ ̶t̶h̶e̶ ̶3̶D̶ ̶p̶o̶i̶n̶t̶s̶ ̶b̶y̶ ̶t̶h̶a̶t̶,̶ ̶p̶e̶r̶f̶o̶r̶m̶ ̶t̶h̶e̶ ̶c̶a̶l̶c̶u̶l̶a̶t̶i̶o̶n̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶n̶ ̶r̶e̶i̶n̶c̶o̶o̶r̶p̶e̶r̶a̶t̶e̶ ̶t̶h̶e̶ ̶I̶M̶U̶ ̶r̶o̶t̶a̶t̶i̶o̶n̶ ̶i̶n̶t̶o̶ ̶t̶h̶e̶ ̶a̶f̶f̶i̶n̶e̶ ̶m̶a̶t̶r̶i̶x̶.̶ ̶T̶h̶e̶n̶ ̶y̶o̶u̶ ̶j̶u̶s̶t̶ ̶h̶a̶v̶e̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶ ̶o̶f̶ ̶s̶y̶n̶c̶i̶n̶g̶ ̶t̶h̶e̶ ̶I̶M̶U̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶c̶a̶m̶e̶r̶a̶s̶.̶

Actually… I just tested it and it definitely works with local 3D points,
so that's nothing to worry about.

KarrsonHeumann

VO -- PLEASE IGNORE THE COMMENTS BELOW… MOST OF WHAT I SAID BEFORE WAS WRONG 🤡
Below is the smart answer 🧠 🤯

But also keep in mind that I forgot the 4th column in the full affine matrix… Oops xD

Firstly, if you are thinking about solving the visual odometry problem with an emphasis on the “visual” part, you are probably doing it wrong. You need to be working in 3D, not 2D. This is the difference between something that looks 3D, but is really 2D (Homography projection) and an actual 3D pose that will be able to transform the Z coordinate, thus allowing full rotations to work properly (Affine). Trust me, even if you understand matrices on an intuitive level, I will now need to provide a little more intuition for you just in case.

An identity matrix… Notice the reasoning why this doesn’t modify the 3D vector.

Now for a homography… Notice how keeping the last row as identity ONLY allows the math to work,

BUT DOES NOT allow changing the Z AT ALL…

That’s highly problematic and since matrices are awesome and this isn’t affine 2D,

it DOES allow for projective warping, but that still doesn’t modify the Z AT ALL…

This is very useful for finding what would be exact point matches in 2D if you for instance use the

cv::Stitcher class as it’s highly accurate and in C++ you can do

Homography = Stitcher.cameras()[1].R * Stitcher.cameras()[0].R.inv()

but again… that is only useful BEFORE you convert the points to 3D (Use it for triangulation)

BUT… With full affine we can do almost ANY transformation on the 3D Vector…

IN FULL 3D (It can modify the Z)

Of course, this specific transformation is probably not the one you want,

And you will need to make sure to re-rigidify each transformation…

For reference, if you want to do this RIGHT, and get a fully decoupled rotation and translation, look here…

But you will NEED 3D points. Hope it helps! (You will also need the 4th column… 🤦‍♀️ )

https://web.archive.org/web/20240206190824/https://stackoverflow.com/questions/56166088/how-to-find-affine-transformation-matrix-between-two-sets-of-3d-points/56228667#56228667