The bounding box accuracy depends on the resolution of the image passed into the NN node. The object detection is run on video feed from a single camera (the camera has both intrinsic and extrinsic parameters as well as known distortion coefficients) and is not related to stereo vision. If you convert the bbox xyz to world coordinate system, you should get a fairly accurate result (again the more pixels in the image, the more precise the result). Though the wide angle cameras introduce more distortion meaning the error would probably be greater than on low FOV cameras.
I hope I answered your question,