There's a warning that came up while running the camera [upl-image-preview url=https://discuss.luxonis.com/assets/files/2024-09-11/1726037861-124891-depthai-prob-001.jpg] The warning advises to upgrade the firmware of the camera, but there isn't any docs about how to upgrade the camera firmware from 0.0.26 \~ 0.0.28. Any advices? Here's more details about the stack used * depthai: ver 2.27.0.0 * Python: 3.10.0

Hi @"syg"#2954 , we'd suggest following this guide: https://docs.luxonis.com/software/depthai-components/bootloader/#Danger%20Zone

Hello, following up on this. Other than the camera firmware not being updated, are there any factors that could’ve caused the ping to miss for the cameras? For context, we’ve 24 OAK-D Pro PoE cameras connected to a switch that supports 24 ethernet ports enabling us to connect all the cameras. Each of the ports is dual speed 100/1000Mbps SFP slots, and they can deliver 30W of power as the switch is configured to have IEEE 802.3at standard PoE+. The service we’re deploying only activates 13 cameras and it’s booted in sequence. However, the issue starts when we wait for the cameras to boot. Out of the 13 cameras, up to 2\~3 cameras will randomly fail to boot. So far here’s our remedy: 1. Kill the service 2. Wait for a while (30s) 3. Boot up the service 4. Hope that all the cameras have successfully booted w/o any pinging issues We repeat steps 1\~4 until all the cameras are booted. This is quite time-consuming and problematic as the connection might drop while the cameras run without warning.

@"syg"#p25815 It's generally the bootloader issue so updating it is best you can do. Also as another safety check: ``` # Check every second for new devices while True: device_infos = dai.Device.getAllAvailableDevices() for device_info in device_infos: if device_info.state != dai.XLinkDeviceState.X_LINK_BOOTLOADER: continue # Start thread here sleep(1) ``` Thanks, Jaka

We've updated the camera boot loader version to the 0.0.28 which is the latest version. However, we're facing another issue where an error is being thrown from one camera which eventually caused all cameras to stop their ping and our service to stopped. The attached image contains the error. [upl-image-preview url=https://discuss.luxonis.com/assets/files/2024-09-18/1726638563-320612-depth-ai-prob-002.jpg] Specifically, `F: [global] [ 606417] [Scheduler02Thr] dispatcherResponseServe:928 [19443010C172801300] [192.168.1.240] [1726580608.248] [host] [warning] Monitor thread (device: 19443010C172801300 [192.168.1.240]) - ping was missed, closing the device connection` What's the likely cause of this & is there anyway to catch this global error that's being thrown by the camera?

@"syg"#p25924 Explained [here](https://docs.luxonis.com/hardware/platform/deploy/poe-deployment-guide/#PoE%20deployment%20guide-Runtime-Debugging). Could be high usage. I'd also check the connections. Even a small disturbance in the TCP connection (a drop of packets) could cause this issue. Thanks, Jaka

[upl-image-preview url=https://discuss.luxonis.com/assets/files/2024-09-22/1726995776-227075-depthai-prob-003.jpg] [upl-image-preview url=https://discuss.luxonis.com/assets/files/2024-09-22/1726995782-138616-depthai-prob-004.png] Each of our camera bandwidth consumption is less than 10mbps which result in a total usage of 13 \* 10 = 130 Mbps which is well within our network maximum bandwidth of 1000mbps. Is our only choice left with auto reconnect and being in standalone mode? Or are there any debugging options we can take? If we're in standalone mode, is it possible to stream the frames from the camera to our server? If so are there any tutorials on it?

@"syg"#p25924 Hi Syg. Could you share how you updated the firmware. I followed the instructions in the link provided by jakaskeri (https://docs.luxonis.com/software/depthai-components/bootloader/#Danger%20Zone) but I can't get past 0.0.26 or depthai version 2.24.0.0 I've deleted the depthai and depthai-python foldesr then cloned it from the current repo master, I've followed the docs again, as though starting with a clean install, installed depthai, installed all requirements. The utility flashed the camera but still at 0.0.26 I'm at a loss to understand why my depthai version won't get past 2.24.0.0 when using the latest repo clone.

@"syg"#p26020 Can you send me a link to the POE switch you are using? Could be that the power requirements exceed the specs (despite saying each port can output 30W). The devices die in a quick succession which could point to a power issue. For standalone mode streaming: https://github.com/luxonis/depthai-experiments/tree/master/gen2-mjpeg-streaming https://github.com/luxonis/depthai-experiments/tree/master/gen2-poe-tcp-streaming @"mikegardner"#p26023 Pulling latest repo is not enough. The depthai version has to be latest. The version is the carrier of the bootloader. Thanks, Jaka

Thanks Jaka. I've deleted all of the folders and re-cloned them at the latest level and re-installed depthai. Please consider my query on this thread closed now. Regarding the drop-out I'm experiencing- I'll continue on my original thread.

Hey @"jakaskerl"#1112, thanks for being patient with this To give you a better overview of how our camera / architecture is setup, here is a diagram [upl-image-preview url=https://discuss.luxonis.com/assets/files/2024-09-23/1727072713-413484-pasted-graphic-1.png] Our server is connected to the cameras via two switches, one switch has all 1000mbps ports fully utilized by the Oak Cameras, it extends to another switch via 1000Base SFP module, the other switch is also rated for 1000Mbps See details of both switches here: * Main network switch: * Unmanaged switch: The power should be sufficient according to the calculations that we did. As mentioned we are only actually using 13 cameras actively but the rest will still be consuming power, just not in use as it is not connected to. ( we may try to disconnect them and see if it improves the situation) This issue only appears on boot and seems to happen more when we connect to 10 or more cameras simultaneously. Usually one or two cameras will have their ping missed. However, in the off chance that all cameras successfully connect and boot the pipeline, the cameras are able to run for few days non-stop without interruption.

How to update outdated OAK-D Pro PoE firmware

hotd

jakaskerl

syg's colleague here, the gpu server has a 1Gbps rated port, everything else sharing the port also should not be using much bandwidth as confirmed by iftop.

we'll do another test by connecting the gpu server directly to the 24 port switch by disconnecting some of the backup cameras.

On an unrelated question, we've followed the depthai guide on syncing frames from multiple streams, however we discovered some skipping in videos recorded (people and objects moving across frames much faster than 5fps). Could bandwidth issues be causing this too?

jakaskerl

Hi hotd
Yes, that could be the culprit.

The reason why is when the bandwidth is saturated, the frames from the cameras get dropped when not read in time.
The would mean there are less frames than the set framerate expects for proper playback, so the end videos look like they have been sped up.

ps. when recording, make sure the CPU and disk is fast enough for real time writing. If host side loop is blocked for too long, the same thing could happen. In this case the PC would be the bottleneck.

Thanks,
Jaka

hotd

jakaskerl

Is there a more systematic way to tell if there are any power or bandwidth or frame dropping issues without having to observe through the side effects?

Currently our server is quite powerful and handling this should not actually be an issue, the server is currently not doing anything other than streaming frames from the camera and recording them into video files

jakaskerl

hotd
You could check the sequence numbers of messages when they arrive to the host side. If the numbers don't increment, it's likely some frames are dropped.

syg

An update, we’re still facing some issues with the cameras having connection issues, and dropped frames from the videos.

Here are some of the things we’ve tried to reduce the bandwidth saturation:

Disconnecting the backup cameras (13) from the main switch (24 port), this is ensure that the connected backup cameras aren’t saturating the bandwidth from the active cameras (11)
Have a direct connection between the GPU server & the main switch (24 port) instead of going through the 5 port switch. This is to reduce the bandwidth saturation on the 5 port switch as it’s used to connect between the main switch & our file server

However, upon booting our application the camera ping isn’t resolved. Out of the 13 cameras, only 7 ~ 10 cameras managed to successfully start. So for now we’ve updated our configuration on only support 10 and it seems to work but we’d still need to restart a few times before the cameras manages to connect.

We’d like to schedule a call with @jakaskerl & @erik to share more details in order to better troubleshoot we’re facing, let us know what’s required to make this happen.

erik

Hi @syg , could you also share the MRE that leads to failed connects after you connect >10 devices to POE switch, so we can test it locally? I'd also suggest Standalone mode as an alternative, as there's no XLink disconnects (because there's no XLink). In general, live debugging over call isn't very fruitful.

syg

Hey @erik, thanks for the prompt reply.

I want to clarify does MRE mean Minimum Reproducible Example? If so, I'd have to discuss with my colleague on how to come up with one.

Additionally, the call I've mentioned isn't to perform a live debugging on our system rather it's us providing you guys a more comprehensive overview of our architecture and specs of our server, network switches etc.

We're hoping that with a much more complete overview it's much easier to point out any potential setup or flaws in our system

jakaskerl

syg I want to clarify does MRE mean Minimum Reproducible Example? If so, I'd have to discuss with my colleague on how to come up with one.

Yes.

syg

Hey @erik @jakaskerl ,

Some updates on our end.

Currently, we are actively using 13 cameras and we’re still facing the issue of ping was missed at two points of our docker application:

When we are starting our docker application.
In the middle of the week
1. As our solution is deployed on prem, we check on it every week. During our checks we discovered that a few cameras had ping was missed which cause our docker application to hang
3. As an example, we started our docker service at 07/10/2024 and the ping was missed showed up at 09/10/2024 causing our service to hang

Here are some troubleshooting steps we’ve tried:

We’ve used a few scripts from your repository for us to obtain some metrics, this enabled us to benchmark against the desired result shown below:

Benchmark

Scripts

Poe Link test result

Here’s a sample result of a few cameras

# Cam 1
Connecting to  10.0.0.11 ...
mxid: 184430101110E8F400 (OK)
speed: 1000 (OK)
full duplex: 1 (OK)
boot mode: 3 (OK)

# Cam 3
Connecting to  10.0.0.13 ...
mxid: 18443010F16EE6F400 (OK)
speed: 1000 (OK)
full duplex: 1 (OK)
boot mode: 3 (OK)

# Cam 6
Connecting to  10.0.0.16 ...
mxid: 18443010517BE7F400 (OK)
speed: 1000 (OK)
full duplex: 1 (OK)
boot mode: 3 (OK)

As seen, our link test managed to hit the benchmark which requires a speed of 1000 Mbps

OAK bandwidth test result

Here’s a sample result of a few camera:

# Cam 1
Downlink 883.4 mbps
Uplink 219.4 mbps

# Cam 3
Downlink 890.0 mbps
Uplink 218.3 mbps

# Cam 6
Downlink 890.6 mbps
Uplink 218.4 mbps

As seen, our downlink and uplink are within the acceptable range of 800 Mbps uplink & 200 Mbps downlink

OAK Latency test

Here’s a sample result of a few cameras:

# Cam 1
Average latency 2.48 ms, Std: 0.4

# Cam 3
Average latency 2.73 ms, Std: 0.3

# Cam 6
Average latency 3.06 ms, Std: 0.3

As seen, our average latency across the camera is less than 10ms which is required to pass the benchmark

After running the test, we can conclude that we’re not limited by our network equipment as the resulting latency, bandwidth and link speed are within the benchmark range stated above.

Additionally, we can confirm that enough power is supplied to the cameras based on the following

As seen from the image, each port can deliver up to 30W to the cameras and each of it is only consuming up to 4~5W which is sufficient for the cameras to function

CPU Usage of cameras

To ensure that our pipeline isn’t too complex which can cause a high CPU usage of the LeonOS and potentially lead to a ping was missed error, we’ve enabled logging on our setup through DEPTHAI_LEVEL=info to observe the CPU usage.

The result shows that our pipeline is only consuming up to 12% of the CPU . Proving that our pipeline isn’t complex enough to cause a high CPU usage.

Allowing longer boot time for the cameras

On the debugging guide, a section indicates that some network equipment might not work well with the default timeout they’ve set.

To ensure that there’s enough time for the camera to boot & the watchdog doesn’t disconnect too early, we’ve introduced two environment variables:

DEPTHAI_BOOTUP_TIMEOUT
DEPTHAI_WATCHDOG_INITIAL_DELAY

And set it’s value to 60000 (60s), allowing more time for camera to ping the server.

However, this didn’t work as ping was missed error still occured upon boot

Additional Context

Based on our above troubleshooting steps, we’re still facing the ping was missed issue. Hence, we’re providing our hardware architecture to provide a better overview of our setup

Lastly, here’s a Minimal Reproducible Example (MRE) that you guys can run on your end to simulate our environment:

import os
import re
import cv2
import math
import time
import yaml
import numpy as np
import socket
#os.environ["DEPTHAI_LEVEL"] = "debug"
import depthai as dai
from collections import deque
from threading import Thread
from datetime import datetime, timedelta
from loguru import logger
from typing import List, Any
from pydantic import BaseModel



class Frame(BaseModel):
    has_rgb: bool
    has_depth: bool
    rgb_frame: Any = None
    depth_frame: Any = None
    timestamp: timedelta 

class OakDProPOE(Thread):

    def __init__(
        self,
        buffer: deque,
        mxid: str,
        server_ip: str,
        server_port: int,
        fps: int = 30,
    ):
        super().__init__()
        self.fps = fps
        self.buffer = buffer
        self.recording = False
        self.ip = server_ip
        self.port = server_port
        self.mxid = mxid
        self.info = self.lookup_camera(mxid)
        self.recording = False
        self.connection: socket.socket = None

    def lookup_camera(self, mxid: str):

        for i in range(3):
            ret, info = dai.Device.getDeviceByMxId(mxid)
            if ret:
                self.mxid = mxid
                return info
        else:
            raise Exception(f"{mxid} not found!")
        
    def start_camera(self):

        self.recording = True

    def stop_camera(self):

        if self.connection:
            self.connection.close()
        self.recording = False

    def poe_pipeline(self):

        logger.info(f"{self.mxid} pipeline initiated")
        pipeline = dai.Pipeline()

        camRgb = pipeline.createColorCamera()
        camRgb.setIspScale(2,3)

        videoEnc = pipeline.create(dai.node.VideoEncoder)
        videoEnc.setDefaultProfilePreset(30, dai.VideoEncoderProperties.Profile.MJPEG)
        camRgb.video.link(videoEnc.input)

        script = pipeline.create(dai.node.Script)
        script.setProcessor(dai.ProcessorType.LEON_CSS)
        videoEnc.bitstream.link(script.inputs['frame'])

        script.setScript(
            f"""
HOST_IP = '{self.ip}'
HOST_PORT = {self.port}
                 
import socket
import time

node.warn(f'>Going to connect to {{HOST_IP}}:{{HOST_PORT}}<')
sock = socket.socket()
sock.connect((HOST_IP, HOST_PORT))

while True:

    pck = node.io["frame"].get()
    data = pck.getData()
    ts = pck.getTimestamp()
    header = f"ABCDE " + str(ts.total_seconds()).ljust(18) + str(len(data)).ljust(8)
    sock.send(bytes(header, encoding='ascii'))
    sock.send(data)
"""
        )
        logger.info(self.mxid)
        device_info = dai.DeviceInfo(self.mxid)
        logger.info(device_info)

        try:

            with dai.Device(pipeline, device_info) as device:
                logger.info(f"Pipeline running on {self.mxid}")
                while True:
                    time.sleep(1)

        except Exception as e:
            logger.error(e)
            raise

    def run(self):

        def get_frame(socket, size):
            bytes = socket.recv(4096)
            while True:
                read = 4096
                if size - len(bytes) < read:
                    read = size - len(bytes)
                bytes += socket.recv(read)
                if size == len(bytes):
                    return bytes

        server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        server.bind(("0.0.0.0", self.port))

        cam_thread = Thread(target=self.poe_pipeline)
        cam_thread.daemon = True
        logger.info("Pipeline started")

        server.listen()

        logger.info("Starting cam thread")
        cam_thread.start()
        self.connection, client = server.accept()
        try:
            logger.info(f"{self.mxid} connected")
            while self.recording:
                header = str(self.connection.recv(32), encoding="ascii")
                chunks = re.split(" +", header)
                if chunks[0] == "ABCDE":
                    # print(f">{header}<")
                    ts = float(chunks[1])
                    imgSize = int(chunks[2])
                    img = get_frame(self.connection, imgSize)
                    buf = np.frombuffer(img, dtype=np.byte)
                    # print(buf.shape, buf.size)
                    frame = cv2.imdecode(buf, cv2.IMREAD_COLOR)

                    frame = Frame(
                        has_rgb=True,
                        has_depth=True,
                        rgb_frame=cv2.cvtColor(frame, cv2.COLOR_BGR2RGB),
                        depth_frame=None,
                        timestamp=timedelta(seconds=ts),
                    )

                    self.buffer.append(frame)

        except Exception as e:
            # TODO: Handle Server Error
            logger.error(f"{self.mxid}: " + str(e))
            raise

        server.close()


class Controller:


    def __init__(self, yaml_file: str, fps: int, server_ip: str):

        self.fps = fps
        self.sync_threshold = timedelta(milliseconds=math.ceil(5000 / self.fps))
        self.server_ip = server_ip
        self.load_camera_mapping_file(yaml_file)
    
    def load_camera_mapping_file(self, yaml_file):

        logger.info(f"Controller reading yaml file: {yaml_file}")
        with open(yaml_file) as f:
            self.camera_mapping = yaml.load(f, Loader=yaml.FullLoader)

        # Display the yaml config
        logger.info(
            yaml.dump(self.camera_mapping, default_flow_style=False, sort_keys=False)
        )

        camera_type = self.camera_mapping["CameraType"]

        self.active_cameras = {}
        for cam in self.camera_mapping["ActiveCameras"]:
            # Create a buffer for the camera
            buffer = deque([], maxlen=self.fps * 50) # buffer holds up to 10 seconds worth of frames
            main_camera_key = cam["main_camera"]["camera_id"]

            if camera_type == "OakDProPoE":
                main_camera = OakDProPOE(
                    fps=self.fps,
                    buffer=buffer,
                    mxid=cam["main_camera"]["mxid"],
                    server_ip=self.server_ip,
                    server_port=cam["main_camera"]["server_port"],
                )


            else:
                raise Exception(f"Camera type {camera_type} is invalid!")
            
            self.active_cameras[main_camera_key] = {
                "buffer": buffer,
                "synced_buffer": deque([], maxlen=500),
                "main_camera": main_camera,
            }

    def get_active_cameras(self):

        return self.active_cameras

    def start_cameras(self):

        for cam_idx in self.active_cameras:
            camera = self.active_cameras[cam_idx]["main_camera"]
            camera.daemon = True
            camera.start_camera()
            camera.start()


        logger.info("Cameras started")

    def stop_cameras(self):

        logger.info("Controller stopping cameras")

        for cam_idx in self.active_cameras:
            camera = self.active_cameras[cam_idx]["main_camera"]

            camera.stop_camera()
            camera.join()

        logger.info("Cameras stopped")

    def check_sync(self, timestamp: timedelta):

        matching_frame_indexes = []
        # Try to find matching frame in each queue
        for active_cam in self.active_cameras.values():
            for i, frame in enumerate(active_cam["synced_buffer"]):
                time_diff = abs(frame.timestamp - timestamp)
                if time_diff <= self.sync_threshold:
                    # We now have the synced frame index for this particular camera
                    matching_frame_indexes.append(i)
                    break

        # When synced frames are found, clear all unused/out-of-sync frames
        if len(matching_frame_indexes) == len(self.active_cameras):

            for i, q in enumerate(self.active_cameras.values()):
                for j in range(0, matching_frame_indexes[i]):
                    q["synced_buffer"].popleft()

            return True

        else:

            return False


    def get_synced_frames(self) -> dict[int, Frame]:

        # Iterate to try and get new frame from any buffer
        start = time.perf_counter()
        for cam in self.active_cameras:

            if self.active_cameras[cam]["buffer"]:
                # Get the frame from camera's buffer to controller's sync buffer
                # this ensures we can synchronously process the frames in sync buffer without worrying about thread safety from the camera buffer
                frame = self.active_cameras[cam]["buffer"].popleft()
                self.active_cameras[cam]["synced_buffer"].append(frame)
                # Check sync to see if we have a group of synchronized frames across all sync buffers
                if self.check_sync(frame.timestamp):

                    data = {}
                    for cam in self.active_cameras:
                        data[cam] = self.active_cameras[cam]["synced_buffer"].popleft()
                    logger.debug(f"Sync took {time.perf_counter() - start} seconds")
                    start = time.perf_counter()
                    return data
                

if __name__ == "__main__":
    
    controller = Controller(
        yaml_file="configs/camera_mapping_office.yaml",
        fps=5,
        server_ip="192.168.1.236"
    )

    controller.start_cameras()

    while True:

        synced_frame = controller.get_synced_frames()

To run the MRE you’d need to install the requirements in from the import statement of the file and have a camera-config.yaml file which reflects the configuration of the cameras.

Here’s how a sample camera camera-config.yaml looks like:

CameraType: OakDProPoE
ActiveCameras:
  - main_camera: 
      camera_id: 25
      mxid: 18443010117CE9F400
      server_port: 10025
    backup:
  - main_camera: 
      camera_id: 26
      mxid: 18443010D1F3E7F400
      server_port: 10026
    backup:
  - main_camera: 
      camera_id: 27
      mxid: 18443010C1D6E1F400
      server_port: 10027
    backup:
  - main_camera: 
      camera_id: 28
      mxid: 19443010C172801300
      server_port: 10028
    backup:

Some notes on the camera-config.yaml

On the actual file, it should contain the configuration for 13 cameras
camera_id & backup: is not required
Each mxid should contain a unique camera mxid

It’d be filled in at line 303 of mre.py:

Here’s how a potential directory structure can look like:

.
└── (some-folder)/
    ├── mre.py
    ├── camera-config.yaml
    └── requirements.txt

jakaskerl

syg
Thanks for the thorough investigation. This seems to have checked all probable points of failure. The issue is depthai's overall stability which we can not fix as the issues stem deeper in Intel's FW.

So I suggest writing a reconnect script that quickly restarts the pipeline in case of missed pings. Simple try-catch would work, but you can make it more sophisticated.
We have implemented reconnect functionality in V3 as well if you wish to use that when it comes out.

Thanks,
Jaka

syg

Hey @jakaskerl ,

Thanks for the suggestion.

As we've provided an MRE, can you take a look at it and point of any flaws / improvement that we can make (other than implementing a reconnect).

Thanks!

jakaskerl

syg
The code looks perfectly fine, no redundant code. I'd maybe add a mechanism to make sure all devices are connected to make sure you don't need to restart the script if one device fails to connect for some reason. As for depthai side, I've nothing to add.

Thanks,
Jaka

syg

Thanks for the prompt reply @jakaskerl .

Is it possible to save the output logs of **DEPTHAI_LEVEL=info **to a file?

jakaskerl

syg
Like done here should work I think:

Thanks,
Jaka

« Previous Page