• Program runs for several hours, then freezes becoming completely unresponsive

Craftonix
I don't see a reason why they would have coded in a probable error like this one in regards to networking issues…this is Python, not Rust.

In essence, if the DHCP was assigning same IP that's all good and predictable, mine does as well, always a .213 and .220 IP allocation for the 2 cams. However, one, they are not static allocated via MAC address, and two, they don't have an infinite "license" allocated to their IP.

What I'm pointing out is, this "might" be the issue causing your disconnects. From what you've described the situation as, going from working for several hours to sudden drop, this is what came to mind.

Good luck with your debugging, still need to even get to your level with the connections issues I'm having ^^.

@The_Real_Enrico_Pallazzo thanks for the tips, but we never experience "disconnects". We never get errors. The Visualizer simply freezes, which means the video feed is stuck at the last frame it got from the camera. The application loop is still running because we can still press keys and exit successfully:

while oak.running():
    key = oak.poll()
    if key == ord('q') # Exit successfully
        break

What I was trying to ask is: if the DHCP server renews the lease, does the TCP socket connection get dropped? If YES, then we would get an error and the program would exit. I'm assuming that with OakCamera as oak: opens a TCP socket to the PoE camera.

    Craftonix
    It's better that you just go ahead and do that tiny test to see if you can replicate your problem (by fiddling with the DHCP Server settings and i.e. setting expiration to 5min).

    Judging by this post: (https://superuser.com/questions/1217093/how-to-prevent-network-outages-caused-by-renewal-of-dhcp-leasing) …

    …And other random bits on the net…..just troubleshoot m8, do the tweak, if it breaks the same way as after many hours, you've discovered your issue, if not, keep debugging.

    Good luck!

    Hi Craftonix
    Any chance you could try a different router (maybe a home router that has more that one LAN port) so we can debug this further. Are you using a switch or are you connecting to the router via WiFi?

    Thanks,
    Jaka

    9 days later

    The system that is freezing is in production. It is a mobile system that is being used in a different venue in a different city every day. This is why it is connected to the AT&T mobile hotspot. It is not possible for us to connect to the venue's router and some days they don't even have a router.

    You are right that the AT&T hotspot only has one Ethernet port, but this port is connected to a TP-Link TL-SG105 Ethernet switch. The Macbook Air M2 is also connected to this switch. An Ethernet cable goes from this switch to a TL-SG1005P PoE Switch. The OAK-D PoE Series 1 is connected to this PoE Switch. All devices are using Ethernet and no WiFi is used anywere. The laptop's WiFi is always turned off.

    I have configured the OAK-D with a static IP address to eliminate any problems due to DHCP lease renewal. However, today, it froze again even with a static IP address of 192.168.1.2. The router is at 192.168.1.1

    Here is a screenshot of how it froze:

    • Note that the ping in the top right terminal continues successfully.
    • The bottom left terminal continues to scroll giving debug info from the camera.
    • The video feed, Frame number, and Running time are all frozen and not updating.
    • LeonOS is 60-70%, LeonRT is 3.5%

    However, when the system is running properly without freezing:

    LeonOS is around 83% and LeonRT is around 35%.

    Hi @Craftonix ,
    Thank you for the thorough report. Could you also tell us your depthai version, and bootloader version flashed on the OAK-D-PoE camera?
    Thanks, Erik

    Could you also provide minimal repro example, so we can try it locally? We somewhat suspect it could also be blocking the pipeline somewhere (see Node blocking docs) after some time due to blocking behaviours of some of the queues (or pools running out of space).

      12 days later

      erik Minimal repro example can be found at the beginning of this post. The code is taken from the example programs, but I'm adding my 'count' module:

      from count import Count # Our math for counting people

      and calling it from the callback:

      def cb(packet: TrackerPacket):
          ...
          frame = Count.count_people(frame, packet.daiTracklets)
          ...

      The count_people method does some geometry calculations based on the centroid of the tracklets to determine the count. Then it draws the 4 text strings on the frame and returns the frame to the callback to be drawn by the visualizer.

      Bootloader version:

      % ./bootloader_version.py
      Found device with name: 192.168.1.2
      Version: 0.0.26
      NETWORK Bootloader, is User Bootloader: False
      Memory 'Memory.FLASH' size: 16777216, info: JEDEC ID: 01 20 18
      Memory 'Memory.EMMC' size: 15758000128, info:

      depthai version:

      run.sh 2023-11-07 21:58:36 EST (UTC-0500) Running application: main.py
      [2023-11-07 21:58:36.648] [depthai] [debug] Python bindings - version: 2.22.0.0 from 2023-06-13 02:13:35 +0200 build: 2023-06-13 00:55:04 +0000
      [2023-11-07 21:58:36.648] [depthai] [debug] Library information - version: 2.22.0, commit: 82ab07d037c02f56042d1d2d55a718f379651ed9 from 2023-06-13 02:13:14 +0200, build: 2023-06-13 00:55:03 +0000
      [2023-11-07 21:58:36.649] [depthai] [debug] Initialize - finished
      [2023-11-07 21:58:36.722] [depthai] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.26.tar.xz' open: 1ms, archive read: 72ms
      [2023-11-07 21:58:36] CRITICAL [root.set_log_level:31] Setting LOGLEVEL=DEBUG
      [2023-11-07 21:58:37.045] [depthai] [debug] Resources - Archive 'depthai-device-fwp-f033fd9c7eb0b3578d12f90302e87759c78cfb36.tar.xz' open: 1ms, archive read: 395ms
      [2023-11-07 21:58:37.502] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTLOADER, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
      [2023-11-07 21:58:37.518] [depthai] [debug] Connected bootloader version 0.0.26
      [2023-11-07 21:58:38.364] [depthai] [debug] DeviceBootloader about to be closed...
      [2023-11-07 21:58:38.364] [depthai] [debug] XLinkResetRemote of linkId: (0)
      [2023-11-07 21:58:39.023] [depthai] [debug] DeviceBootloader closed, 659
      [2023-11-07 21:58:39.030] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTED, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
      [1844301021D93E1300] [192.168.1.2] [4.767] [system] [info] Memory Usage - DDR: 0.12 / 337.18 MiB, CMX: 2.04 / 2.50 MiB, LeonOS Heap: 25.29 / 80.03 MiB, LeonRT Heap: 2.89 / 41.06 MiB
      [1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Temperatures - Average: 44.41C, CSS: 46.00C, MSS 43.50C, UPA: 43.27C, DSS: 44.87C
      [1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Cpu Usage - LeonOS 52.18%, LeonRT: 0.62%

      See here for the exit codes (return error codes) that my main.py is exiting with. There are UNKOWN error codes which I cannot explain.

      Just to give a bit more detail, I am using a shell script run.sh as a supervisor which restarts the application if it exits abnormally:

      # run.sh
      
      # This is the supervisor. It starts the application and monitors its return code (rc):
      #   If the application exits successfully, the supervisor's loop is done.
      #   If the application aborts, the supervisor waits a DELAY, then restarts it.
      
      export DEPTHAI_LEVEL=debug
      
      
      DELAY="${OAK_SUPERVISOR_DELAY:-10}"
      
      function log() {
        args="$*"
        echo "run.sh $(date +'%Y-%m-%d %H:%M:%S %Z (UTC%z)') ${args}" | tee -a run.log
      }
      
      while true; do
        log "Running application: main.py"
        # python main2.py
        python main.py
        rc=$?
        if [ "$rc" == "0" ]; then
          log "[EXIT] main.py exited successfully."
          exit 0; # Successful exit from app
        elif [ "$rc" == "1" ]; then
          log "[NO_REPLY] main.py exited abnormally (webserver not responding) retrun code rc=${rc}. Sleeping for $DELAY seconds."
        elif [ "$rc" == "2" ]; then
          log "[SIM_ABORT] main.py simulated an abnormal exit with return code rc=${rc}. Sleeping for $DELAY seconds."
        elif [ "$rc" == "3" ]; then
          log "[NOBODY] main.py exited abnormally (nobody in view) retrun code rc=${rc}. Sleeping for $DELAY seconds."
        else
          log "main.py exited with an unkown return code rc=${rc}. Sleeping for $DELAY seconds."
        fi
        sleep "$DELAY"
      done

      and the main.py snippet that generates the exit codes:

      with OakCamera() as oak:
          ...
          oak.start(blocking=False) # Start the pipeline (upload it to the OAK)
      
          rc = 99      # Unkown exit
          KEY_ESC = 27 # Escape key
      
          poll_webserver.start_time = time.time()
          while oakc.running():
              # Since we are not in blocking mode, we have to poll oak camera to
              # visualize frames, call callbacks, process keyboard keys, etc.
      
              if not poll_webserver(): # If camera web server is not responding
                  rc = 1         # Signal abort!
                  break;
      
              key = oakc.poll()
      
              if key == ord('d'): logging_env.set_log_level('DEBUG')
              if key == ord('i'): logging_env.set_log_level('INFO')
              if key == ord('w'): logging_env.set_log_level('WARNING')
              if key == ord('v'): logging_env.toggle_verbose()
              if key == ord('f'): logging_env.toggle_file()
                  
              if key == ord('a'):
                  rc = 2         # Simulate abort!
                  break
              if key == ord('q') or key == KEY_ESC: # Exit successfully
                  count.reset_count()
                  rc = 0                     # Exit successfully
                  break
              
              if key == ord('r'):
                  print('RUN mode.')
              if key == ord('t'):
                  print('TEST mode.')
              
              if key == ord(' '):
                  count.direction += 1 # Rotate the direction
                  if count.direction==4: # Cycle back
                      count.direction=0    # to first direction
                  logging.info(f'{prefix()} Direction is now [{count.DIRECTIONS[count.direction]}]')
                  count.save_direction()
      
          logging_env.log_event_and_exit(rc)

      As you can see from run.log, there are several rc=99 which are unexplained exits of the program.

        Hi Craftonix
        I'll run the script locally, see if I can reproduce the issue.

        Thanks,
        Jaka

        ok, but just to clarify a bit more, when rc==99, this indicates that the while oak.running() loop exited because oak.running() returned FALSE

          5 days later

          Hi Craftonix
          Could you post the full MRE of the code you are using. I ran the code you first posted without the "count" and had no problems.

          Thanks,
          Jaka

          4 months later

          We are also facing same issue. Our cameras at client end freezes the frames after about 6 days of continuous running. The script runs without fail that means camera is not disconnected. and in our code we save the frame of the camera after specific detection happens. We got the same frame after the last freezing event. This also shows that host is continuously receiving frames but it is the same frame at the end of the freeze. That is camera is sending the same frame to the host after it freezes.

          Is there any solution on this?

          Hi @NileshBawiskar
          Interesting, would you mind creating an issue in a separate thread and maybe post some images/reproducible code?

          Thanks,
          Jaka