So, we have a bit of a plumbing nightmare that consists of a body of code that probes for PoE cameras, then dispatches a thread per camera which dynamically loads the appropriate body of code for the camera in question and performs callbacks to the ultimate consumer based on queue activity.

The vast majority of the time this works just fine, but occasionally during initialization of more than one camera, things blow up deep in XLink land. We basically start the new threads as fast as we can, so there's a probability of one that we're hitting things in the guts of thing while booting all the cameras. My guess is that the code path for associating a pipeline and booting a device, in our case this little snippet:

// Connect to device and start pipeline
dai:πŸ˜ƒevice device(pipeline,
drongoCameras[DRONGO_CAM_HASH(startBlock -> uid)].devInfo);

May not be entirely thread safe.

Because it works the vast majority of the time it's been difficult to characterize beyond "every now and then it dies horribly during initialization". Some examples:

[Thread 0x7fffc2f73700 (LWP 467188) exited]
[New Thread 0x7fffc2772700 (LWP 467191)]
[New Thread 0x7fffb9b8b700 (LWP 467192)]
[2023-03-15 13:04:01.353] [warning] Monitor thread (device: 1844301031455C1200 [192.168.88.129]) - ping was missed, closing the device connection
F: [global] [    643353] [EventRead00Thr] tcpipPlatformRead:272	Cannot find file descriptor by key: 58
[Thread 0x7fffbab8d700 (LWP 467190) exited]
terminate called after throwing an instance of 'dai::XLinkWriteError'
  what():  Couldn't write data to stream: '__bootloader' (X_LINK_ERROR)
---
[Thread 0x7fffc2772700 (LWP 468300) exited]
[New Thread 0x7fffc1f71700 (LWP 468303)]
[New Thread 0x7fffb9b8b700 (LWP 468304)]
F: [global] [    787036] [Scheduler00Thr] dispatcherResponseServe:925	no request for this response: XLINK_WRITE_RESP 1

[2023-03-15 13:39:54.242] [warning] Monitor thread (device: 1844301031455C1200 [192.168.88.129]) - ping was missed, closing the device connection
F: [global] [    796242] [EventRead00Thr] tcpipPlatformRead:272	Cannot find file descriptor by key: 58
[Thread 0x7fffbab8d700 (LWP 468302) exited]
terminate called after throwing an instance of 'dai::XLinkWriteError'
  what():  Couldn't write data to stream: '__bootloader' (X_LINK_ERROR)
---
[Thread 0x7fffc3774700 (LWP 468706) exited]
terminate called after throwing an instance of 'dai::XLinkWriteError'
  what():  Couldn't write data to stream: '__bootloader' (X_LINK_ERROR)

Thread 28 "testharness" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffbb38e700 (LWP 468701)]
0x00007ffff75b700b in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) where
#0  0x00007ffff75b700b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7596859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff7971a31 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff797d5dc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff797d647 in std::terminate() ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff797d8e9 in __cxa_throw ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000555555c6ffbe in dai::XLinkStream::writeSplit (
    this=0x7fffb166b7e0, d=0x7fffaa662010, size=23491155, split=5242880)
    at /home/chris/luxonis/depthai-core/src/xlink/XLinkStream.cpp:143
#7  0x0000555555b11afa in dai::DeviceBootloader::bootMemory (
    this=0x7fffbb36e110, embeddedFw=...)
    at /home/chris/luxonis/depthai-core/src/device/DeviceBootloader.cpp:1318
#8  0x0000555555a530ec in dai::DeviceBase::init2 (this=0x7fffbb36e910, 
    cfg=..., pathToMvcmd=..., pipeline=...)
    at /home/chris/luxonis/depthai-core/src/device/DeviceBase.cpp:596
#9  0x0000555555a515b8 in dai::DeviceBase::init (this=0x7fffbb36e910, 
    version=dai::OpenVINO::VERSION_2022_1, 
--Type <RET> for more, q to quit, c to continue without paging--
    maxUsbSpeed=dai::UsbSpeed::SUPER, pathToMvcmd=...)
    at /home/chris/luxonis/depthai-core/src/device/DeviceBase.cpp:479
#10 0x0000555555a4f285 in dai::DeviceBase::DeviceBase (
    this=0x7fffbb36e910, version=dai::OpenVINO::VERSION_2022_1, 
    devInfo=..., maxUsbSpeed=dai::UsbSpeed::SUPER)
    at /home/chris/luxonis/depthai-core/src/device/DeviceBase.cpp:326
#11 0x0000555555a409ed in dai::DeviceBase::DeviceBase<bool, true> (
    this=0x7fffbb36e910, version=dai::OpenVINO::VERSION_2022_1, 
    devInfo=..., usb2Mode=false)
    at /home/chris/luxonis/depthai-core/include/depthai/device/DeviceBase.hpp:257
#12 0x0000555555a3c62d in dai::Device::Device (this=0x7fffbb36e910, 
    pipeline=..., devInfo=...)
    at /home/chris/luxonis/depthai-core/src/device/Device.cpp:44
#13 0x00007ffff7f525b4 in rawCameraModel (
    startInfo=0x7ffff7fc59b8 <drongoCameras+248>)
    at /home/chris/farmwave/drongo-core/src/sensor/sensor.cpp:367
#14 0x00007ffff7ebe609 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007ffff7693133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

I suppose I can go ahead and serialize access to the constructor for dai:πŸ˜ƒevice and see what happens, but I was curious if this was a known problem.

Ugh. Sorry for the formatting, that's what I get for C&P'ing stuff.

So, FWIW, adding serialization around the call to the Device constructor has symptomatically fixed the problem. We'll see if it sticks.

  • erik replied to this.

    Hi wiley42 ,
    In general, initialization is not thread safe, which is why we suggest waiting eg. 500ms between connecting to multiple cameras - demo here. Could you try adding that as well?
    Thanks, Erik

    Ah, yeah, that's consistent with what I was observing, both the failure and the serialization I put around the constructor call for Device, although now I'm wondering if I need to stick serialization around the calls to all constructors that might be called on distinct threads. If problems persist I'll try adding a delay between launching each thread, but that's dodgy just because I don't know how long it's going to take for thread creation or when the thread will actually first get scheduled πŸ˜›

    Anyhow, thanks for confirming; at least now I know I'm not any more crazy than usual πŸ˜‰

      Hi wiley42 ,
      Yeah, it's something with XLink protocol that was written by SOC vendor... our FW team has been contemplating rewriting it for months now, perhaps we really should start working on itπŸ™‚
      Thanks, Erik

      Send up a flare if you need community support. We have a silly number of these things on order and it's in our interest to see that the underlying stack is as robust as possible -- otherwise I really did pick the wrong week to quit sniffing glue πŸ˜‰

      • erik replied to this.

        Hi wiley42 ,
        In the case of a single device, it's robust, but yes for multiple devices it's lacking.. Thanks for the offered help, what's the best email address to use so we can reach out if/when neededπŸ™‚?
        Thanks, Erik

        Hey Erik,

        For the moment, I think christian@farmwave.io.

        If I'm getting the message from our finance people right, we have hundreds of these things on order with global shutters, so we're kinda predisposed to help out in whatever way we can -- including slinging firmware if it comes to that -- since we're apparently more than a little committed at this point : P