• OAK-D S2 PoE; "ping was missed" error; LeonOS usage increase.

Hello,

I'm experiencing a couple of problems.

  1. "ping was missed" error within 1 hour of usage.
  2. Rare sudden LeonOS usage increase.

Problem 1:

"ping was missed" error with my OAK-D PoE cameras within 1 hour of usage.

My setup:

  • Camera: mainly OAK-D S2 PoE.
  • Connectivity: OAK-D S2 PoE → 1Gb PoE injector → 1Gb switch → Jetson Orin NX 8 GB.
  • Other software: Docker (on Jetson) using --network host and ROS2.
  • Cables: Cat 5e.
  • Data transmitted: 400p rectified raw left, right, disparity, and 1080p RGB MJPEG.
  • Depthai versions: Camera bootloader v0.0.26, Depthai-python v2.22.

What I tried:

  • Removing the switch.
  • Testing on Jetson Xavier.
  • Testing with OAK-D PoE.
  • Updating bootloader - the error occurred within 10 minutes before upgrading, now it occurs within 1 hour.
  • Retrieving a crash dump - the script says no crash dump found on camera.

Observations:

On my laptop, LeonOS CPU usage is ~75%, while on Jetson Orin, LeonOS CPU usage is ~100% with reduced FPS. 100% usage is mitigated by setting `ethtool -C eth0 rx-usecs` from a default value of 0 to 400, which reduces CPU usage to ~93%. Setting it to 2000 reduces CPU usage to ~87%. This does not seem to have an impact on the error.

Error logs:

Attempt 1:

[host]   [warning] Monitor thread (device: 194430108198721300 [169.254.1.222]) - ping was missed, closing the device connection
[system] [info]    Memory Usage - DDR: 49.20 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.96 / 80.03 MiB, LeonRT Heap: 5.00 / 41.06 MiB
[system] [info]    Temperatures - Average: 61.15C, CSS: 62.73C, MSS 59.79C, UPA: 60.84C, DSS: 61.26C
[system] [info]    Cpu Usage - LeonOS 92.82%, LeonRT: 22.25%
[host]   [debug]   Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
[host]   [debug]   Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
[ros_node] Exception in thread Thread-1:
[ros_node] Traceback (most recent call last):
....
[ros_node]     frame = self.disparity_queue.get().getCvFrame()
[ros_node] RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'disparity' (X_LINK_ERROR)' 

Attempt 2:

[system] [info] Temperatures - Average: 57.23C, CSS: 58.51C, MSS 55.72C, UPA: 57.66C, DSS: 57.01C
[system] [info] Cpu Usage - LeonOS 99.87%, LeonRT: 21.58%
[host] [warning] Monitor thread (device: 194430108198721300 [169.254.1.222]) - ping was missed, closing the device connection
[host] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
[host] [debug] Timesync thread exception caught: Couldn't write data to stream: '__timesync' (X_LINK_ERROR)
[ros_node] Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
[host] [debug] Device about to be closed...
[host] [debug] Device closed, 1997

Problem 2:

Rare sudden LeonOS usage increase.

While the camera is running on the laptop (same setup as bug 1), its CPU is ~75%. After a while of simply running, something happens, and suddenly the CPU spikes to 100% with a lower FPS.

Log:

...
[system] [info] Temperatures - Average: 54.20C, CSS: 55.72C, MSS 53.12C, UPA: 53.99C, DSS: 53.99C
[system] [info] Cpu Usage - LeonOS 76.54%, LeonRT: 16.99%
[system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.28 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
[system] [info] Temperatures - Average: 54.20C, CSS: 55.07C, MSS 52.68C, UPA: 54.21C, DSS: 54.86C
[system] [info] Cpu Usage - LeonOS 72.79%, LeonRT: 17.81%
[system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.28 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
[system] [info] Temperatures - Average: 55.07C, CSS: 56.15C, MSS 53.55C, UPA: 55.29C, DSS: 55.29C
[system] [info] Cpu Usage - LeonOS 95.71%, LeonRT: 21.59%
[system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.28 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
[system] [info] Temperatures - Average: 54.53C, CSS: 55.72C, MSS 53.77C, UPA: 53.99C, DSS: 54.64C
[system] [info] Cpu Usage - LeonOS 100.00%, LeonRT: 21.98%
[system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.28 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
[system] [info] Temperatures - Average: 54.86C, CSS: 56.37C, MSS 53.77C, UPA: 54.42C, DSS: 54.86C
[system] [info] Cpu Usage - LeonOS 99.70%, LeonRT: 22.61%
[system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.28 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
...

We plan on purchasing more cameras for our project, but we need to resolve these issues first. Could anyone provide guidance to address these problems?

  • erik replied to this.

    Hi John194 ,
    Thank you for the detailed report, we will investigate this issue further. In the meantime, could you try:

    1. Increasing watchdog delay
    2. Decrease 3A FPS, to reduce CPU usage. I suspect the "ping missed" issue is caused due to high CPU usage.

    Thanks, Erik

      Thanks for the reply. I forgot to mention that I have already set 3A FPS to 13, while the camera FPS is set to 30. I don't want 3A FPS any lower for my application. I didn't try testing the Watchdog delay and will report back soon.

      erik

      When I set the watchdog variable to 60s in the docker entry point and ran my program, ping was still missed.

      [system] [info] Memory Usage - DDR: 49.19 / 337.18 MiB, CMX: 2.41 / 2.50 MiB, LeonOS Heap: 66.16 / 80.03 MiB, LeonRT Heap: 4.99 / 41.06 MiB
      [system] [info] Temperatures - Average: 56.15C, CSS: 57.01C, MSS 55.51C, UPA: 55.94C, DSS: 56.15C
      [system] [info] Cpu Usage - LeonOS 95.65%, LeonRT: 21.50%
      [host] [warning] Monitor thread (device: 194430108198721300 [169.254.1.222]) - ping was missed, closing the device connection
      [host] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
      [host] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
      [WARN] [depthai_camera]: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'right' (X_LINK_ERROR)'
      [host] [debug] Device about to be closed...
      [host] [debug] Device closed, 1972
      [host] [warning] Watchdog initial delay set to 60000ms

      @John194 do you mind setting the following variable:

      DEPTHAI_WATCHDOG=30000

      The one mentioned in documentation only affects the inital boot and not the interval pings. (CC: @erik )

      Note: in case of clean exit of the host program, the device might not be available again for cca 30s with above set.

        @John194

        Also, if you can provide a MRE, that would help us pin down the issue in case it is crash related

          themarpe

          Having set DEPTHAI_WATCHDOG=30000 , the camera died within a few seconds. Logs didn't mention a missed ping and no crash dump. Will try to make an MRE later.

          [host] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
          [host] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
          [depthai_camera]: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'left' (X_LINK_ERROR)'
          [host] [debug] Device about to be closed...
          [host] [debug] Device closed, 14997
          [host] [warning] Using a custom watchdog value of 30000ms
          [system] [warning] Watchdog was capped from 30000ms to 4500ms
          6 days later

          themarpe

          After further testing with my mentioned OAK-D S2 PoE, a new OAK-D S2 PoE, and OAK-D PoE, here are some observations:

          • Sometimes "ping was missed" occurs a few times within 10 minutes, other times it can go on for hours.
          • "ping was missed" was very frequent for all cameras on bootloader v0.0.19. On v0.0.26 it happens more rarely;
          • The new OAK-D S2 PoE and OAK-D PoE "ping was missed" seem less frequent than the old OAK-D S2 PoE;
          • Testing on a Laptop without running docker, "ping was missed" seems to happen less frequently than on a Jetson with docker;
          • "ping was missed" happens on Jetson Orin Nano too;
          • Changing network cables to shielded ones did not help;
          • Turning off the RGB camera doesn't help;
          • Turning off left and right cameras with disparity seems to throw "ping was missed" less often;

          @John194 does this happen under a specific pipeline or do many examples showcase this behavior?

          If its the former, an MRE of the pipeline would be great to have to pin down the root cause.

          In the latter, what is your networking setup like, as we've saw this playing a role in this case as well

            themarpe

            This happens with many examples, but they have some similarities. I already posted the code in my previous message, it is pretty simple.

            What should I provide about my network setup?

            • erik replied to this.
              7 days later