Hi mzbt
Afaik when bounding box is inferred by the detection part of the node, some smaller portion of that bbox rectangle (defined by setBoundingBoxScaleFactor()) will be fed into the spatial part (essentially a spatial calculator node). This smaller region of the RGB image (aligned with depth map), can be taken to acquire the average Z value of the object. At the same time X and Y are computed from Z and FOV.
Thanks,
Jaka