Program runs for several hours, then freezes becoming completely unresponsive

Craftonix · Sep 21, 2023

I'm using color camera as input to Yolo v8, similar to this code:

from count import Count # Our math for counting people

def cb(packet: TrackerPacket):
    visualizer = packet.visualizer
    frame = packet.frame
    frame = Count.count_people(frame, packet.daiTracklets)
    frame = visualizer.draw(frame)
    cv2.imshow('My Visualizer', frame)

with OakCamera() as oak:
    cam = oak.create_camera('color')
    nn  = oak.create_nn('yolov8n_coco_640x352', cam, tracker=True)
    nn.config_nn(resize_mode='stretch')
    visualizer = oak.visualize([nn.out.tracker], fps=True, callback=cb)

The program runs fine for several hours, displaying the camera feed in the visualizer and updating the standard output with the count, then freezes. The visualizer is frozen and the console where I typed

python main.py

is also not updating the count, so we're loosing the count.

The only remedy is to Ctrl-C break the program and restart it, which of course is unacceptable

The program is running on an M2 Macbook Air.

How to diagonse? Can I enable some logs? When it's frozen, is there any query that I can run to check what is happening?

jakaskerl · Sep 23, 2023

Hi Craftonix
AFAIK there isn't a way to check if the device/pipeline is frozen. I'd suggest enabling the debug/trace logs to see if the culprit can be seen from there. Check the device CPU and ram usage; also check the processes usage on the host for the particular script when frozen (HTOP or something similar).

Thanks,
Jaka

Craftonix · Sep 23, 2023

I also have a webserver running in a script node very similar to this:

https://github.com/luxonis/depthai-python/blob/main/examples/Script/script_http_server.py

So, when the camera feed freezes (and it happens every day because we're running almost 12 hrs/day), the webserver is not responsive either. This tells me that the device/pipeline is frozen and not the host.

How do I enable debug/trace logs? Please don't tell me that you want to store them to eMMC, because you have still not resolved my inability to write to eMMC, even after we have bought the second camera (Series 1 OAK-D PoE) because you did not populate eMMC in the first one (Series 2 OAK-D PoE) which was supposed to have eMMC according to your datasheet.

jakaskerl · Sep 25, 2023

Hi Craftonix
I see. I would assume the same thing happens even if the device is connected directly to the host machine. That would allow you to run it in debug mode. Since it is likely not the host, it should be fine.

I will also try and run your code locally to see if I get the same outcome.

Thanks and sorry for the inconvenience,
Jaka

Craftonix · Oct 7, 2023

Any news?

Here is something that I discovered today: if I set the Python logging level as follows:

import logging
logging.basicConfig(level='DEBUG', force=True)

then I sometimes get the following in the standard output:

DEBUG:urllib3.connectionpool:https://sentry.luxonis.com:443 "POST /api/3/envelope/ HTTP/1.1" 200 41

So, my question is why is this appearing in the logs?

And what happens if there is no internet connection?

erik · Oct 7, 2023

Craftonix if there are any issues (errors), we send the error log to our sentry server. I think this message gets logged inside sentry library, as I can't seem to find it in the SDK itself. If not internet connection, it will just skip sending error log.

TThe_Real_Enrico_Pallazzo · Oct 7, 2023

Craftonix

I was having some issues with my PoE…multi-cam config. After a lot of learning, one thing is somewhat popping out in my mind in regards to your problem. The DHCP server has an allocation time for the IP of the cameras (else they default to .222 ending domain). If said time allocation is exceeded, IP may be dropped or renewed (perhaps to new IP)

So you might want to check your DHCP settings for the following (the beneath is standard):

default-lease-time 600;

max-lease-time 7200;

Craftonix · Oct 9, 2023

@The_Real_Enrico_Pallazzo Are you saying that the socket connection to the camera would be dropped if the DHCP server is not configured properly?

In our case, the DHCP-assigned IP address has been the same for the last 2 months and has never changed.

But if I understand you correctly, is it possible that the DHCP server issues a RENEW request and even though it gives the same IP address again, it disconnect the socket and re-connects it? But if this is the case, wouldn't we get a Python exception and exit the program?

TThe_Real_Enrico_Pallazzo · Oct 12, 2023

Craftonix
I don't see a reason why they would have coded in a probable error like this one in regards to networking issues…this is Python, not Rust.

In essence, if the DHCP was assigning same IP that's all good and predictable, mine does as well, always a .213 and .220 IP allocation for the 2 cams. However, one, they are not static allocated via MAC address, and two, they don't have an infinite "license" allocated to their IP.

What I'm pointing out is, this "might" be the issue causing your disconnects. From what you've described the situation as, going from working for several hours to sudden drop, this is what came to mind.

Good luck with your debugging, still need to even get to your level with the connections issues I'm having ^^.

Craftonix · Oct 15, 2023

@The_Real_Enrico_Pallazzo thanks for the tips, but we never experience "disconnects". We never get errors. The Visualizer simply freezes, which means the video feed is stuck at the last frame it got from the camera. The application loop is still running because we can still press keys and exit successfully:

while oak.running():
    key = oak.poll()
    if key == ord('q') # Exit successfully
        break

What I was trying to ask is: if the DHCP server renews the lease, does the TCP socket connection get dropped? If YES, then we would get an error and the program would exit. I'm assuming that with OakCamera as oak: opens a TCP socket to the PoE camera.

TThe_Real_Enrico_Pallazzo · Oct 16, 2023

Craftonix
It's better that you just go ahead and do that tiny test to see if you can replicate your problem (by fiddling with the DHCP Server settings and i.e. setting expiration to 5min).

Judging by this post: (https://superuser.com/questions/1217093/how-to-prevent-network-outages-caused-by-renewal-of-dhcp-leasing) …

…And other random bits on the net…..just troubleshoot m8, do the tweak, if it breaks the same way as after many hours, you've discovered your issue, if not, keep debugging.

Good luck!

Craftonix · Oct 16, 2023

@The_Real_Enrico_Pallazzo what is m8 ?

The DHCP is provided by an AT&T mobile hostpot router: the Netgear Nighthawk M6 Pro:

https://www.downloads.netgear.com/files/GDC/MR6500/MR6500_MR6110_UM_EN.pdf

It does not provide a method for setting the DHCP lease time. I read all the manual and checked all the settings in the web interface.

jakaskerl · Oct 17, 2023

Hi Craftonix
Any chance you could try a different router (maybe a home router that has more that one LAN port) so we can debug this further. Are you using a switch or are you connecting to the router via WiFi?

Thanks,
Jaka

Craftonix · Oct 27, 2023

The system that is freezing is in production. It is a mobile system that is being used in a different venue in a different city every day. This is why it is connected to the AT&T mobile hotspot. It is not possible for us to connect to the venue's router and some days they don't even have a router.

You are right that the AT&T hotspot only has one Ethernet port, but this port is connected to a TP-Link TL-SG105 Ethernet switch. The Macbook Air M2 is also connected to this switch. An Ethernet cable goes from this switch to a TL-SG1005P PoE Switch. The OAK-D PoE Series 1 is connected to this PoE Switch. All devices are using Ethernet and no WiFi is used anywere. The laptop's WiFi is always turned off.

I have configured the OAK-D with a static IP address to eliminate any problems due to DHCP lease renewal. However, today, it froze again even with a static IP address of 192.168.1.2. The router is at 192.168.1.1

Here is a screenshot of how it froze:

Note that the ping in the top right terminal continues successfully.
The bottom left terminal continues to scroll giving debug info from the camera.
The video feed, Frame number, and Running time are all frozen and not updating.
LeonOS is 60-70%, LeonRT is 3.5%

However, when the system is running properly without freezing:

LeonOS is around 83% and LeonRT is around 35%.

erik · Oct 27, 2023

Hi @Craftonix ,
Thank you for the thorough report. Could you also tell us your depthai version, and bootloader version flashed on the OAK-D-PoE camera?
Thanks, Erik

erik · Oct 27, 2023

Could you also provide minimal repro example, so we can try it locally? We somewhat suspect it could also be blocking the pipeline somewhere (see Node blocking docs) after some time due to blocking behaviours of some of the queues (or pools running out of space).

Craftonix · Nov 8, 2023

erik Minimal repro example can be found at the beginning of this post. The code is taken from the example programs, but I'm adding my 'count' module:

from count import Count # Our math for counting people

and calling it from the callback:

def cb(packet: TrackerPacket):
    ...
    frame = Count.count_people(frame, packet.daiTracklets)
    ...

The count_people method does some geometry calculations based on the centroid of the tracklets to determine the count. Then it draws the 4 text strings on the frame and returns the frame to the callback to be drawn by the visualizer.

Craftonix · Nov 8, 2023

Bootloader version:

% ./bootloader_version.py
Found device with name: 192.168.1.2
Version: 0.0.26
NETWORK Bootloader, is User Bootloader: False
Memory 'Memory.FLASH' size: 16777216, info: JEDEC ID: 01 20 18
Memory 'Memory.EMMC' size: 15758000128, info:

depthai version:

run.sh 2023-11-07 21:58:36 EST (UTC-0500) Running application: main.py
[2023-11-07 21:58:36.648] [depthai] [debug] Python bindings - version: 2.22.0.0 from 2023-06-13 02:13:35 +0200 build: 2023-06-13 00:55:04 +0000
[2023-11-07 21:58:36.648] [depthai] [debug] Library information - version: 2.22.0, commit: 82ab07d037c02f56042d1d2d55a718f379651ed9 from 2023-06-13 02:13:14 +0200, build: 2023-06-13 00:55:03 +0000
[2023-11-07 21:58:36.649] [depthai] [debug] Initialize - finished
[2023-11-07 21:58:36.722] [depthai] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.26.tar.xz' open: 1ms, archive read: 72ms
[2023-11-07 21:58:36] CRITICAL [root.set_log_level:31] Setting LOGLEVEL=DEBUG
[2023-11-07 21:58:37.045] [depthai] [debug] Resources - Archive 'depthai-device-fwp-f033fd9c7eb0b3578d12f90302e87759c78cfb36.tar.xz' open: 1ms, archive read: 395ms
[2023-11-07 21:58:37.502] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTLOADER, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
[2023-11-07 21:58:37.518] [depthai] [debug] Connected bootloader version 0.0.26
[2023-11-07 21:58:38.364] [depthai] [debug] DeviceBootloader about to be closed...
[2023-11-07 21:58:38.364] [depthai] [debug] XLinkResetRemote of linkId: (0)
[2023-11-07 21:58:39.023] [depthai] [debug] DeviceBootloader closed, 659
[2023-11-07 21:58:39.030] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTED, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
[1844301021D93E1300] [192.168.1.2] [4.767] [system] [info] Memory Usage - DDR: 0.12 / 337.18 MiB, CMX: 2.04 / 2.50 MiB, LeonOS Heap: 25.29 / 80.03 MiB, LeonRT Heap: 2.89 / 41.06 MiB
[1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Temperatures - Average: 44.41C, CSS: 46.00C, MSS 43.50C, UPA: 43.27C, DSS: 44.87C
[1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Cpu Usage - LeonOS 52.18%, LeonRT: 0.62%

Craftonix · Nov 8, 2023

See here for the exit codes (return error codes) that my main.py is exiting with. There are UNKOWN error codes which I cannot explain.

Just to give a bit more detail, I am using a shell script run.sh as a supervisor which restarts the application if it exits abnormally:

# run.sh

# This is the supervisor. It starts the application and monitors its return code (rc):
#   If the application exits successfully, the supervisor's loop is done.
#   If the application aborts, the supervisor waits a DELAY, then restarts it.

export DEPTHAI_LEVEL=debug


DELAY="${OAK_SUPERVISOR_DELAY:-10}"

function log() {
  args="$*"
  echo "run.sh $(date +'%Y-%m-%d %H:%M:%S %Z (UTC%z)') ${args}" | tee -a run.log
}

while true; do
  log "Running application: main.py"
  # python main2.py
  python main.py
  rc=$?
  if [ "$rc" == "0" ]; then
    log "[EXIT] main.py exited successfully."
    exit 0; # Successful exit from app
  elif [ "$rc" == "1" ]; then
    log "[NO_REPLY] main.py exited abnormally (webserver not responding) retrun code rc=${rc}. Sleeping for $DELAY seconds."
  elif [ "$rc" == "2" ]; then
    log "[SIM_ABORT] main.py simulated an abnormal exit with return code rc=${rc}. Sleeping for $DELAY seconds."
  elif [ "$rc" == "3" ]; then
    log "[NOBODY] main.py exited abnormally (nobody in view) retrun code rc=${rc}. Sleeping for $DELAY seconds."
  else
    log "main.py exited with an unkown return code rc=${rc}. Sleeping for $DELAY seconds."
  fi
  sleep "$DELAY"
done

and the main.py snippet that generates the exit codes:

with OakCamera() as oak:
    ...
    oak.start(blocking=False) # Start the pipeline (upload it to the OAK)

    rc = 99      # Unkown exit
    KEY_ESC = 27 # Escape key

    poll_webserver.start_time = time.time()
    while oakc.running():
        # Since we are not in blocking mode, we have to poll oak camera to
        # visualize frames, call callbacks, process keyboard keys, etc.

        if not poll_webserver(): # If camera web server is not responding
            rc = 1         # Signal abort!
            break;

        key = oakc.poll()

        if key == ord('d'): logging_env.set_log_level('DEBUG')
        if key == ord('i'): logging_env.set_log_level('INFO')
        if key == ord('w'): logging_env.set_log_level('WARNING')
        if key == ord('v'): logging_env.toggle_verbose()
        if key == ord('f'): logging_env.toggle_file()
            
        if key == ord('a'):
            rc = 2         # Simulate abort!
            break
        if key == ord('q') or key == KEY_ESC: # Exit successfully
            count.reset_count()
            rc = 0                     # Exit successfully
            break
        
        if key == ord('r'):
            print('RUN mode.')
        if key == ord('t'):
            print('TEST mode.')
        
        if key == ord(' '):
            count.direction += 1 # Rotate the direction
            if count.direction==4: # Cycle back
                count.direction=0    # to first direction
            logging.info(f'{prefix()} Direction is now [{count.DIRECTIONS[count.direction]}]')
            count.save_direction()

    logging_env.log_event_and_exit(rc)

As you can see from run.log, there are several rc=99 which are unexplained exits of the program.

jakaskerl · Nov 9, 2023

Hi Craftonix
I'll run the script locally, see if I can reproduce the issue.

Thanks,
Jaka