I believe that both issues that I'm about to describe have a common root cause but the first one is just passing undetected. Both reproduce interchangeably on Orin Nano Devkits with different sets of 4 cameras after unspecified time (usually tens of minutes, not hours, but our record is 8h) of running the attached repro.py
. The script was built on top of gen2-multiple-devices
from depthai-experiments
and tries to mimic our normal production interactions with OAK devices.
First variant of the issue
After some time, one of the connected cameras silently stops feeding the XLink queue. The device remains connected via USB, there are no logs in system journal and dmesg that would indicate any other issues or correlating events. At the time the issue starts, there's a sudden drop in the Cpu Usage for LeonOS reported by the DepthAI when running with DEPTHAI_LEVEL=debug
. However, there's no error being reported by the library (full log added as an attachment). Once the issue starts, the queue is never going to be filled with data again, the only way is to restart the device and start the pipeline again.
$$
[184430105112B00E00] [1.2.1] [1490.429] [system] [info] Cpu Usage - LeonOS 18.15%, LeonRT: 3.91%
[184430105112B00E00] [1.2.1] [1491.430] [system] [info] Cpu Usage - LeonOS 18.17%, LeonRT: 3.88%
[184430105112B00E00] [1.2.1] [1492.431] [system] [info] Cpu Usage - LeonOS 17.90%, LeonRT: 3.88%
[184430105112B00E00] [1.2.1] [1493.432] [system] [info] Cpu Usage - LeonOS 12.05%, LeonRT: 1.86% # <-- issue starts
[184430105112B00E00] [1.2.1] [1494.433] [system] [info] Cpu Usage - LeonOS 10.09%, LeonRT: 1.10%
...
[184430105112B00E00] [1.2.1] [1501.440] [system] [info] Cpu Usage - LeonOS 10.71%, LeonRT: 1.07%
$$
Second variant of the issue
Sometimes the issue manifests as a RuntimeError
when calling tryGet()
which is possible to handle but happens too often to be a feasible approach:
$$
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
$$
We've ran into these issues after switching from NVIDIA Xavier NX Devkits to Orin Nano Devkits as hosts for OAK cameras. We could reproduce the issue on two different Orin Nano units. All runtime environments were identical, provisioned with the same ansible playbooks. We've been able to run our system on Xavier NXs for days without any issues, compared to just minutes/hours on Orin Nanos, thus we can rule out cabling issues.
Software/hardware
Used cameras: a mix of OAK-D Pro and OAK-D Pro W (USB, 8 cameras total, using 4 at a time)
Working hosts: 2 x Jetson Xavier NX 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
Failing hosts: 2 x Jetson Orin Nano 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
Python: 3.11
DepthAI: 2.23.0.0
For some reason I'm unable to attach the logs and reproduction script directly to the post because neither .py, .txt, .zip or .tar.gz were accepted:
I've uploaded the archive to gdrive:
Please let me know if there's another preferred way of sharing files.