I believe that both issues that I'm about to describe have a common root cause but the first one is just passing undetected. Both reproduce interchangeably on Orin Nano Devkits with different sets of 4 cameras after unspecified time (usually tens of minutes, not hours, but our record is 8h) of running the attached repro.py. The script was built on top of gen2-multiple-devices from depthai-experiments and tries to mimic our normal production interactions with OAK devices.

First variant of the issue

After some time, one of the connected cameras silently stops feeding the XLink queue. The device remains connected via USB, there are no logs in system journal and dmesg that would indicate any other issues or correlating events. At the time the issue starts, there's a sudden drop in the Cpu Usage for LeonOS reported by the DepthAI when running with DEPTHAI_LEVEL=debug. However, there's no error being reported by the library (full log added as an attachment). Once the issue starts, the queue is never going to be filled with data again, the only way is to restart the device and start the pipeline again.

$$
[184430105112B00E00] [1.2.1] [1490.429] [system] [info] Cpu Usage - LeonOS 18.15%, LeonRT: 3.91%
[184430105112B00E00] [1.2.1] [1491.430] [system] [info] Cpu Usage - LeonOS 18.17%, LeonRT: 3.88%
[184430105112B00E00] [1.2.1] [1492.431] [system] [info] Cpu Usage - LeonOS 17.90%, LeonRT: 3.88%
[184430105112B00E00] [1.2.1] [1493.432] [system] [info] Cpu Usage - LeonOS 12.05%, LeonRT: 1.86% # <-- issue starts
[184430105112B00E00] [1.2.1] [1494.433] [system] [info] Cpu Usage - LeonOS 10.09%, LeonRT: 1.10%
...
[184430105112B00E00] [1.2.1] [1501.440] [system] [info] Cpu Usage - LeonOS 10.71%, LeonRT: 1.07%
$$

Second variant of the issue

Sometimes the issue manifests as a RuntimeError when calling tryGet() which is possible to handle but happens too often to be a feasible approach:

$$
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
$$

We've ran into these issues after switching from NVIDIA Xavier NX Devkits to Orin Nano Devkits as hosts for OAK cameras. We could reproduce the issue on two different Orin Nano units. All runtime environments were identical, provisioned with the same ansible playbooks. We've been able to run our system on Xavier NXs for days without any issues, compared to just minutes/hours on Orin Nanos, thus we can rule out cabling issues.

Software/hardware

Used cameras: a mix of OAK-D Pro and OAK-D Pro W (USB, 8 cameras total, using 4 at a time)
Working hosts: 2 x Jetson Xavier NX 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
Failing hosts: 2 x Jetson Orin Nano 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
Python: 3.11
DepthAI: 2.23.0.0

For some reason I'm unable to attach the logs and reproduction script directly to the post because neither .py, .txt, .zip or .tar.gz were accepted:

I've uploaded the archive to gdrive:

Please let me know if there's another preferred way of sharing files.

    12 days later

    jakaskerl

    Hi, it's been almost two weeks since the last message in this thread. Have you managed to reproduce our issue? Please let me know if you need any more information or would like to test some hypotheses.

    15 days later

    Hi rf_unitem
    I have tried running 3 cameras on the ORIN we have at our office, ran it over weekend, had no problems. Its the dev kit version that has a 19V power supply; perhaps there is a power issue on your orins...

    RIght now, the device is used by some other testing rig, so I can't make further tests, but will probably plug in 4 cameras over weekend, to see if it makes difference.

    Thanks,
    Jaka

      jakaskerl We also could observe the issue when running only 3 cameras but it took much longer to manifest.

      We used the NVIDIA-supplied 19V power adapters as well as a bench PSU and even a Bosch battery to power Jetsons. In order to eliminate power issue on the board itself, we tried using OAK Y Adapters to externally power just the OAK devices but it didn't help.

      Orins come in many flavors, I'd like to confirm that you're using the same one as we do -- Orin Nano Developer Kit.

        a month later

        jakaskerl

        I haven't heard from you for a while. Meanwhile, we managed to buy two more Orin Nano Devkits and reproduce the issue with the script attached to the first post.

          Hi @rf_unitem
          Sorry for not replying. Had to quickly test the jetson since it was needed for some other task and I couldn't reproduce the issue.
          The jetson is now free and ready for testing. If you managed to reproduce the issue on two other devices, perhaps I'm doing something wrong then. I will reflash the jetson to fresh OS and retry. Then forward to FW team so we can fix this.

          Thanks,
          Jaka

          Hi @rf_unitem
          Letting you know I managed to reproduce the issue on fresh install of stock NVIDIA OS (previously had some custom yahboom sw).

          EDIT: Will be doing some tests with different device configurations to see if I can pinpoint what the issue is so our FW team has an easier task patching the issue. Thanks for your patience.

          Thanks,
          Jaka

            5 days later

            Hi rf_unitem
            I have tried running script with 4, 3 and 2 devices at once. Also ran it two-by-two as to check for multi-threading issues. It used to crash when more than 2 devices were connected.
            I suspected a power issue so I added a Y splitter to all four devices and connected them to a separate charger. Ran it for 10h and experienced no issues.

            Thanks,
            Jaka

              jakaskerl

              I've already tried using Y splitters, unfortunately, without a success.

              rf_unitem In order to eliminate power issue on the board itself, we tried using OAK Y Adapters to externally power just the OAK devices but it didn't help.

              How does this:

              jakaskerl I suspected a power issue so I added a Y splitter to all four devices and connected them to a separate charger. Ran it for 10h and experienced no issues.

              seem to solve the problem, while

              jakaskerl Letting you know I managed to reproduce the issue on fresh install of stock NVIDIA OS (previously had some custom yahboom sw).

              the issue didn't manifest on the other OS on the same board?. Is this purely a configuration problem then, or was it power issue from the very beginning?

              I've also already tried disabling USB3 power states support in kernel to rule that out.

                Hi rf_unitem
                I agree this is strange and I do not know the answer to it. Have a feeling it had something to do with drivers on the weird stock OS being more "permissive".
                I'm currently running the scripts again, watching system logs.

                Thanks,
                Jaka