Program runs for several hours, then freezes becoming completely unresponsive
- Edited
I was having some issues with my PoE…multi-cam config. After a lot of learning, one thing is somewhat popping out in my mind in regards to your problem. The DHCP server has an allocation time for the IP of the cameras (else they default to .222 ending domain). If said time allocation is exceeded, IP may be dropped or renewed (perhaps to new IP)
So you might want to check your DHCP settings for the following (the beneath is standard):
default-lease-time 600;
max-lease-time 7200;
@The_Real_Enrico_Pallazzo Are you saying that the socket connection to the camera would be dropped if the DHCP server is not configured properly?
In our case, the DHCP-assigned IP address has been the same for the last 2 months and has never changed.
But if I understand you correctly, is it possible that the DHCP server issues a RENEW request and even though it gives the same IP address again, it disconnect the socket and re-connects it? But if this is the case, wouldn't we get a Python exception and exit the program?
Craftonix
I don't see a reason why they would have coded in a probable error like this one in regards to networking issues…this is Python, not Rust.
In essence, if the DHCP was assigning same IP that's all good and predictable, mine does as well, always a .213 and .220 IP allocation for the 2 cams. However, one, they are not static allocated via MAC address, and two, they don't have an infinite "license" allocated to their IP.
What I'm pointing out is, this "might" be the issue causing your disconnects. From what you've described the situation as, going from working for several hours to sudden drop, this is what came to mind.
Good luck with your debugging, still need to even get to your level with the connections issues I'm having ^^.
@The_Real_Enrico_Pallazzo thanks for the tips, but we never experience "disconnects". We never get errors. The Visualizer simply freezes, which means the video feed is stuck at the last frame it got from the camera. The application loop is still running because we can still press keys and exit successfully:
while oak.running():
key = oak.poll()
if key == ord('q') # Exit successfully
break
What I was trying to ask is: if the DHCP server renews the lease, does the TCP socket connection get dropped? If YES, then we would get an error and the program would exit. I'm assuming that with OakCamera as oak:
opens a TCP socket to the PoE camera.
Craftonix
It's better that you just go ahead and do that tiny test to see if you can replicate your problem (by fiddling with the DHCP Server settings and i.e. setting expiration to 5min).
Judging by this post: (https://superuser.com/questions/1217093/how-to-prevent-network-outages-caused-by-renewal-of-dhcp-leasing) …
…And other random bits on the net…..just troubleshoot m8, do the tweak, if it breaks the same way as after many hours, you've discovered your issue, if not, keep debugging.
Good luck!
@The_Real_Enrico_Pallazzo what is m8 ?
The DHCP is provided by an AT&T mobile hostpot router: the Netgear Nighthawk M6 Pro:
https://www.downloads.netgear.com/files/GDC/MR6500/MR6500_MR6110_UM_EN.pdf
It does not provide a method for setting the DHCP lease time. I read all the manual and checked all the settings in the web interface.
The system that is freezing is in production. It is a mobile system that is being used in a different venue in a different city every day. This is why it is connected to the AT&T mobile hotspot. It is not possible for us to connect to the venue's router and some days they don't even have a router.
You are right that the AT&T hotspot only has one Ethernet port, but this port is connected to a TP-Link TL-SG105 Ethernet switch. The Macbook Air M2 is also connected to this switch. An Ethernet cable goes from this switch to a TL-SG1005P PoE Switch. The OAK-D PoE Series 1 is connected to this PoE Switch. All devices are using Ethernet and no WiFi is used anywere. The laptop's WiFi is always turned off.
I have configured the OAK-D with a static IP address to eliminate any problems due to DHCP lease renewal. However, today, it froze again even with a static IP address of 192.168.1.2. The router is at 192.168.1.1
Here is a screenshot of how it froze:
- Note that the ping in the top right terminal continues successfully.
- The bottom left terminal continues to scroll giving debug info from the camera.
- The video feed,
Frame number
, andRunning time
are all frozen and not updating. - LeonOS is 60-70%, LeonRT is 3.5%
However, when the system is running properly without freezing:
LeonOS is around 83% and LeonRT is around 35%.
Hi @Craftonix ,
Thank you for the thorough report. Could you also tell us your depthai version, and bootloader version flashed on the OAK-D-PoE camera?
Thanks, Erik
Could you also provide minimal repro example, so we can try it locally? We somewhat suspect it could also be blocking the pipeline somewhere (see Node blocking docs) after some time due to blocking behaviours of some of the queues (or pools running out of space).
erik Minimal repro example can be found at the beginning of this post. The code is taken from the example programs, but I'm adding my 'count' module:
from count import Count # Our math for counting people
and calling it from the callback:
def cb(packet: TrackerPacket):
...
frame = Count.count_people(frame, packet.daiTracklets)
...
The count_people
method does some geometry calculations based on the centroid of the tracklets to determine the count. Then it draws the 4 text strings on the frame and returns the frame to the callback to be drawn by the visualizer.
Bootloader version:
% ./bootloader_version.py
Found device with name: 192.168.1.2
Version: 0.0.26
NETWORK Bootloader, is User Bootloader: False
Memory 'Memory.FLASH' size: 16777216, info: JEDEC ID: 01 20 18
Memory 'Memory.EMMC' size: 15758000128, info:
depthai version:
run.sh 2023-11-07 21:58:36 EST (UTC-0500) Running application: main.py
[2023-11-07 21:58:36.648] [depthai] [debug] Python bindings - version: 2.22.0.0 from 2023-06-13 02:13:35 +0200 build: 2023-06-13 00:55:04 +0000
[2023-11-07 21:58:36.648] [depthai] [debug] Library information - version: 2.22.0, commit: 82ab07d037c02f56042d1d2d55a718f379651ed9 from 2023-06-13 02:13:14 +0200, build: 2023-06-13 00:55:03 +0000
[2023-11-07 21:58:36.649] [depthai] [debug] Initialize - finished
[2023-11-07 21:58:36.722] [depthai] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.26.tar.xz' open: 1ms, archive read: 72ms
[2023-11-07 21:58:36] CRITICAL [root.set_log_level:31] Setting LOGLEVEL=DEBUG
[2023-11-07 21:58:37.045] [depthai] [debug] Resources - Archive 'depthai-device-fwp-f033fd9c7eb0b3578d12f90302e87759c78cfb36.tar.xz' open: 1ms, archive read: 395ms
[2023-11-07 21:58:37.502] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTLOADER, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
[2023-11-07 21:58:37.518] [depthai] [debug] Connected bootloader version 0.0.26
[2023-11-07 21:58:38.364] [depthai] [debug] DeviceBootloader about to be closed...
[2023-11-07 21:58:38.364] [depthai] [debug] XLinkResetRemote of linkId: (0)
[2023-11-07 21:58:39.023] [depthai] [debug] DeviceBootloader closed, 659
[2023-11-07 21:58:39.030] [depthai] [debug] Searching for booted device: DeviceInfo(name=192.168.1.2, mxid=1844301021D93E1300, X_LINK_BOOTED, X_LINK_TCP_IP, X_LINK_MYRIAD_X, X_LINK_SUCCESS), name used as hint only
[1844301021D93E1300] [192.168.1.2] [4.767] [system] [info] Memory Usage - DDR: 0.12 / 337.18 MiB, CMX: 2.04 / 2.50 MiB, LeonOS Heap: 25.29 / 80.03 MiB, LeonRT Heap: 2.89 / 41.06 MiB
[1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Temperatures - Average: 44.41C, CSS: 46.00C, MSS 43.50C, UPA: 43.27C, DSS: 44.87C
[1844301021D93E1300] [192.168.1.2] [4.768] [system] [info] Cpu Usage - LeonOS 52.18%, LeonRT: 0.62%
See here for the exit codes (return error codes) that my main.py is exiting with. There are UNKOWN error codes which I cannot explain.
Just to give a bit more detail, I am using a shell script run.sh
as a supervisor which restarts the application if it exits abnormally:
# run.sh
# This is the supervisor. It starts the application and monitors its return code (rc):
# If the application exits successfully, the supervisor's loop is done.
# If the application aborts, the supervisor waits a DELAY, then restarts it.
export DEPTHAI_LEVEL=debug
DELAY="${OAK_SUPERVISOR_DELAY:-10}"
function log() {
args="$*"
echo "run.sh $(date +'%Y-%m-%d %H:%M:%S %Z (UTC%z)') ${args}" | tee -a run.log
}
while true; do
log "Running application: main.py"
# python main2.py
python main.py
rc=$?
if [ "$rc" == "0" ]; then
log "[EXIT] main.py exited successfully."
exit 0; # Successful exit from app
elif [ "$rc" == "1" ]; then
log "[NO_REPLY] main.py exited abnormally (webserver not responding) retrun code rc=${rc}. Sleeping for $DELAY seconds."
elif [ "$rc" == "2" ]; then
log "[SIM_ABORT] main.py simulated an abnormal exit with return code rc=${rc}. Sleeping for $DELAY seconds."
elif [ "$rc" == "3" ]; then
log "[NOBODY] main.py exited abnormally (nobody in view) retrun code rc=${rc}. Sleeping for $DELAY seconds."
else
log "main.py exited with an unkown return code rc=${rc}. Sleeping for $DELAY seconds."
fi
sleep "$DELAY"
done
and the main.py
snippet that generates the exit codes:
with OakCamera() as oak:
...
oak.start(blocking=False) # Start the pipeline (upload it to the OAK)
rc = 99 # Unkown exit
KEY_ESC = 27 # Escape key
poll_webserver.start_time = time.time()
while oakc.running():
# Since we are not in blocking mode, we have to poll oak camera to
# visualize frames, call callbacks, process keyboard keys, etc.
if not poll_webserver(): # If camera web server is not responding
rc = 1 # Signal abort!
break;
key = oakc.poll()
if key == ord('d'): logging_env.set_log_level('DEBUG')
if key == ord('i'): logging_env.set_log_level('INFO')
if key == ord('w'): logging_env.set_log_level('WARNING')
if key == ord('v'): logging_env.toggle_verbose()
if key == ord('f'): logging_env.toggle_file()
if key == ord('a'):
rc = 2 # Simulate abort!
break
if key == ord('q') or key == KEY_ESC: # Exit successfully
count.reset_count()
rc = 0 # Exit successfully
break
if key == ord('r'):
print('RUN mode.')
if key == ord('t'):
print('TEST mode.')
if key == ord(' '):
count.direction += 1 # Rotate the direction
if count.direction==4: # Cycle back
count.direction=0 # to first direction
logging.info(f'{prefix()} Direction is now [{count.DIRECTIONS[count.direction]}]')
count.save_direction()
logging_env.log_event_and_exit(rc)
As you can see from run.log, there are several rc=99
which are unexplained exits of the program.
ok, but just to clarify a bit more, when rc==99
, this indicates that the while oak.running()
loop exited because oak.running()
returned FALSE
We are also facing same issue. Our cameras at client end freezes the frames after about 6 days of continuous running. The script runs without fail that means camera is not disconnected. and in our code we save the frame of the camera after specific detection happens. We got the same frame after the last freezing event. This also shows that host is continuously receiving frames but it is the same frame at the end of the freeze. That is camera is sending the same frame to the host after it freezes.
Is there any solution on this?
Hi @NileshBawiskar
Interesting, would you mind creating an issue in a separate thread and maybe post some images/reproducible code?
Thanks,
Jaka