R
rf_unitem

  • Mar 1, 2024
  • Joined Nov 22, 2023
  • 0 best answers
  • jakaskerl

    I've already tried using Y splitters, unfortunately, without a success.

    rf_unitem In order to eliminate power issue on the board itself, we tried using OAK Y Adapters to externally power just the OAK devices but it didn't help.

    How does this:

    jakaskerl I suspected a power issue so I added a Y splitter to all four devices and connected them to a separate charger. Ran it for 10h and experienced no issues.

    seem to solve the problem, while

    jakaskerl Letting you know I managed to reproduce the issue on fresh install of stock NVIDIA OS (previously had some custom yahboom sw).

    the issue didn't manifest on the other OS on the same board?. Is this purely a configuration problem then, or was it power issue from the very beginning?

    I've also already tried disabling USB3 power states support in kernel to rule that out.

    • jakaskerl

      I haven't heard from you for a while. Meanwhile, we managed to buy two more Orin Nano Devkits and reproduce the issue with the script attached to the first post.

      • jakaskerl We also could observe the issue when running only 3 cameras but it took much longer to manifest.

        We used the NVIDIA-supplied 19V power adapters as well as a bench PSU and even a Bosch battery to power Jetsons. In order to eliminate power issue on the board itself, we tried using OAK Y Adapters to externally power just the OAK devices but it didn't help.

        Orins come in many flavors, I'd like to confirm that you're using the same one as we do -- Orin Nano Developer Kit.

        • jakaskerl

          Hi, it's been almost two weeks since the last message in this thread. Have you managed to reproduce our issue? Please let me know if you need any more information or would like to test some hypotheses.

        • I believe that both issues that I'm about to describe have a common root cause but the first one is just passing undetected. Both reproduce interchangeably on Orin Nano Devkits with different sets of 4 cameras after unspecified time (usually tens of minutes, not hours, but our record is 8h) of running the attached repro.py. The script was built on top of gen2-multiple-devices from depthai-experiments and tries to mimic our normal production interactions with OAK devices.

          First variant of the issue

          After some time, one of the connected cameras silently stops feeding the XLink queue. The device remains connected via USB, there are no logs in system journal and dmesg that would indicate any other issues or correlating events. At the time the issue starts, there's a sudden drop in the Cpu Usage for LeonOS reported by the DepthAI when running with DEPTHAI_LEVEL=debug. However, there's no error being reported by the library (full log added as an attachment). Once the issue starts, the queue is never going to be filled with data again, the only way is to restart the device and start the pipeline again.

          $$
          [184430105112B00E00] [1.2.1] [1490.429] [system] [info] Cpu Usage - LeonOS 18.15%, LeonRT: 3.91%
          [184430105112B00E00] [1.2.1] [1491.430] [system] [info] Cpu Usage - LeonOS 18.17%, LeonRT: 3.88%
          [184430105112B00E00] [1.2.1] [1492.431] [system] [info] Cpu Usage - LeonOS 17.90%, LeonRT: 3.88%
          [184430105112B00E00] [1.2.1] [1493.432] [system] [info] Cpu Usage - LeonOS 12.05%, LeonRT: 1.86% # <-- issue starts
          [184430105112B00E00] [1.2.1] [1494.433] [system] [info] Cpu Usage - LeonOS 10.09%, LeonRT: 1.10%
          ...
          [184430105112B00E00] [1.2.1] [1501.440] [system] [info] Cpu Usage - LeonOS 10.71%, LeonRT: 1.07%
          $$

          Second variant of the issue

          Sometimes the issue manifests as a RuntimeError when calling tryGet() which is possible to handle but happens too often to be a feasible approach:

          $$
          RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
          $$

          We've ran into these issues after switching from NVIDIA Xavier NX Devkits to Orin Nano Devkits as hosts for OAK cameras. We could reproduce the issue on two different Orin Nano units. All runtime environments were identical, provisioned with the same ansible playbooks. We've been able to run our system on Xavier NXs for days without any issues, compared to just minutes/hours on Orin Nanos, thus we can rule out cabling issues.

          Software/hardware

          Used cameras: a mix of OAK-D Pro and OAK-D Pro W (USB, 8 cameras total, using 4 at a time)
          Working hosts: 2 x Jetson Xavier NX 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
          Failing hosts: 2 x Jetson Orin Nano 8GB Devkit (tested Jetpack 5.1.1 and 5.1.2)
          Python: 3.11
          DepthAI: 2.23.0.0

          For some reason I'm unable to attach the logs and reproduction script directly to the post because neither .py, .txt, .zip or .tar.gz were accepted:

          I've uploaded the archive to gdrive:

          Please let me know if there's another preferred way of sharing files.