There are a lot of good models out there, just choose one from the model zoo. Use SpatialImgDetections. Find the x,y,z per frame, and then find the difference in distance between distance, and divide by timestamp of frames and you are set.
You will likely want to average this with a window to get rid of noise.