S
sbellini

  • May 16, 2024
  • Joined Feb 8, 2023
  • 0 best answers
  • Dear Jaka

    Thank you for your quick reply!

    jakaskerl Then do preview.setKeepAspectRatio(False) which will keep the FOV and downscale only.

    Yes, I didn't think of that, I could do that (maybe with some tuning of the Neural network), thank you!

    Just for my understanding, what's the reason why some setResizeThumbnail make the pipeline crash?

    After some tests changing OUTPUT_SIZE here's what I've found:

    OUTPUT_SIZE = 360, 361, 362, 363, 364 OK

    OUTPUT_SIZE = 365 to 392 crash (Fatal error. Please report to developers. Log: 'ResourceLocker' '358')

    OUTPUT_SIZE = 393, 394 OK

    OUTPUT_SIZE = 395, 396, 397, 398 crash

    OUTPUT_SIZE = 399, 400, 401, 402 OK

    I don't get why some works and some not, I would expect them to take almost the same amount of memory (especially with very similar sizes like 398 and 399).

    Thank you again for your help

    Best regards

    Simone

  • Dear Jaka,
    Thank you for your reply.

    jakaskerl You are essentially running out of memory when using large images.

    I don't get why it runs out of memory with a setResizeThumbnail(384,384) and not a setResizeThumbnail(800,800) (which would generate bigger images).

    jakaskerl Why do this? The resolution has no effect on the NN input.

    The MRE I provided is the first step of a bigger pipeline in which a script node receives object coordinates from a MobileNetDetectionNetwork and extracts the object from the original image, so extracting them from a 4K image would give me objects with 4 times the size.

    jakaskerl You only get full FOV with isp output at 12MP.

    Sorry, I probably expressed myself badly, I meant to say that I cannot use resize.initialConfig.setResize(384,384)and I need to use resize.initialConfig.setResizeThumbnail(384,384)because I need to keep the FOV of the 1080p (or 4K) image.
    Thank you again for your help

    Kind regards

    Simone

    • Hi everyone!

      I'm trying to change my pipeline ColorCamera resolution from 1080P to 4K to have bigger images. Also, I also need to keep the full FOV of the sensor, so I opted in using an ImageManip node with a setResizeThumbnail. While this works properly with 1080P images, with 4K images the code crashes with these logs:

      [system] [critical] Fatal error. Please report to developers. Log: 'ResourceLocker' '358'

      [host] [warning] Device crashed, but no crash dump could be extracted.

      I've noticed that everything works if I resize the image into a bigger one, for instance a resize.initialConfig.setResizeThumbnail(800, 800, 0, 0, 0) does not give any problem (but the size is then too big for my NN).

      Can you provide any help please?

      Here is the MRE using an Openvino neural network. I'm using Depthai 2.25.0 and an OAK-1 device.

      import os
      import depthai as dai
      import blobconverter
      import cv2
      
      OUTPUT_SIZE = 384
      
      pipeline = dai.Pipeline()
      
      cam = pipeline.create(dai.node.ColorCamera)
      cam.setResolution(dai.ColorCameraProperties.SensorResolution.THE_4_K)
      cam.setInterleaved(False)
      cam.setPreviewSize(3840,2160) #4K
      
      resize = pipeline.create(dai.node.ImageManip)
      resize.initialConfig.setResizeThumbnail(OUTPUT_SIZE, OUTPUT_SIZE, 0, 0, 0) #<- This one causes crash, if OUTPUT_SIZE = 800 it works (but not size for NN)
      resize.initialConfig.setKeepAspectRatio(False)
      resize.setMaxOutputFrameSize(OUTPUT_SIZE * OUTPUT_SIZE * 3)
      cam.preview.link(resize.inputImage)
      
      x_out = pipeline.create(dai.node.XLinkOut)
      x_out.setStreamName("image")
      resize.out.link(x_out.input)
      
      detection_nn = pipeline.create(dai.node.MobileNetDetectionNetwork)
      detection_nn.setConfidenceThreshold(0.7)
      detection_nn.setBlobPath(blobconverter.from_zoo(name="person-detection-0201"))
      resize.out.link(detection_nn.input)
          
      x_out1 = pipeline.create(dai.node.XLinkOut)
      x_out1.setStreamName("det")
      detection_nn.out.link(x_out1.input)
      
      with dai.Device(pipeline) as device:
          qImg = device.getOutputQueue(name="image", maxSize=3, blocking=False)
          yoloQ = device.getOutputQueue(name="det", maxSize=3, blocking=False)
          cv2.namedWindow("Image", cv2.WINDOW_NORMAL)
          while True:
              img = qImg.get()
              gs = img.getCvFrame()
              dets = yoloQ.get().detections
              cv2.imshow("Image", gs)
              if cv2.waitKey(1) == ord('q'):  # Exit when 'q' is pressed
                  break

      Here is the pipeline graph

      Thank you for your help!

      Best regards

      Simone

      • So if we have the best performances with AVAILABLE_SHAVES/2 shaves and we know that each NN uses two thread with one NCE and AVAILABLE_SHAVES/2 shaves each, it means that each NN running will always be using all available resources. In this way, we are not in fact exploiting the pipeline advantages when using more than one network as the resources are indeed already saturated.
        Do you have any advice on how to speedup pipelines with multiple NNs beside using less computation-intensive networks? If I modify the face-detection NN allowing it to take 5x body images together (5x3xwidthxheight) that would help with increasing the FPS when multiple bodies are detected, but how could I create such matrices? And would ImageManip nodes be able to handle them? Do you have any demo exploiting this technique?
        Thank you again!
        Kind regards
        Simone

        • erik replied to this.
        • Thank you for the quick reply!
          We tried analyzing the HW resources as you suggested, here are the results.

          With a single NN (body detector, compiled with 5 SHAVEs) we have:
          NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
          ColorCamera allocated resources: no shaves; cmx slices: [13-15]
          ImageManip allocated resources: shaves: [15-15] no cmx slices.
          DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
          DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
          As was also shown on the documentation you linked me, the NN have allocated 13 SHAVEs and 13 CMX slices. In total, 14 SHAVEs, 16CMX slices and 2 NCE are allocated (even though the DetectionNetwork only uses 10).

          I also tested the one-stage pipeline using the face detector (the one running at 30FPS, compiled with 4 SHAVEs) and here are the results:
          NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
          ColorCamera allocated resources: no shaves; cmx slices: [13-15]
          ImageManip allocated resources: shaves: [15-15] no cmx slices.
          DetectionNetwork(5) - Needed resources: shaves: 4, ddr: 2728832
          DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
          The resources allocated seem to be the same as before, even though the FPS are higher with this NN.

          With multiple NN(body, face, age/gender) we have:
          NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
          ColorCamera allocated resources: no shaves; cmx slices: [13-15]
          ImageManip allocated resources: shaves: [15-15] no cmx slices.
          DetectionNetwork(5) - Needed resources: shaves: 5, ddr: 9142272
          NeuralNetwork(18) - Needed resources: shaves: 4, ddr: 368640
          DetectionNetwork(11) - Needed resources: shaves: 4, ddr: 2728832
          DetectionNetwork(5) - Inference thread count: 2, number of shaves allocated per thread: 5, number of Neural Compute Engines (NCE) allocated per thread: 1
          NeuralNetwork(18) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
          DetectionNetwork(11) - Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1

          So, is the keypoint here the fact that we are using all NCE and (almost) all SHAVEs with a single NN? Here we should be using 26 SHAVEs and 6 NCEs. If so, then I understand why it is not behaving as a pipeline when using more than one.

          I tried compiling all the NNs with 2 SHAVEs in order to keep the total number of allocated SHAVEs less than the available amount (2SHAVEs * 2 thread per NN * 3 NNs <13), but the performances worsen.

          Also, is the body detection NN slower than the face detection because of the internal layout of the network?
          We tried running the benchmark_app as you suggested but we couldn't make it run on MyriadX, on CPU the FPS of the face-detection are more than twice the one of the body-detection NN

          Thank you again and kind regards

          Simone

          • erik replied to this.
          • Hi everyone!
            We have some questions regarding the pipeline.
            We have tested the body detector from OpenVino (person-detection-0201) alone in single-NN pipeline (camera->image manip->NN->output) and we have measured an average FPS of 21; then, we tested the face detector (face-detection-retail-0004) alone (in the same simple pipeline) and we have measured a FPS of 30. We have then tried to link the two neural networks together (one after the other, the body is cropped on the original image and the result is sent to the face detector) and measured the resulting FPS: while with no bodies on the screen the FPS remained 21, having one body (and thus "enabling" the face detector) brings down the average FPS to 14. We don't know why this behavior happens as the second neural network should be faster than the first one and, as per our knowledge of the concept of pipeline, the overall throughput time should be (almost) equal to the throughput of the slowest node.

            From a previous (private) reply from the team it seems that it's because we are running two NNs now that share the same resources. Does this mean that we already used up all the resources the sensor has? We tried doing as instructed here: https://docs.luxonis.com/projects/api/en/latest/tutorials/debugging/ with the two NNs and here is the result:

            Memory Usage - DDR: 235.12 / 340.43 MiB, CMX: 2.50 / 2.50 MiB, LeonOS Heap: 24.41 / 77.32 MiB, LeonRT Heap: 20.64 / 41.23 MiB
            Temperatures - Average: 40.39 C, CSS: 41.66 C, MSS 39.81 C, UPA: 40.04 C, DSS: 40.04 C
            Cpu Usage - LeonOS 21.19%, LeonRT: 56.91%

            From this it seems only CMX is full while other memories and CPUs still have resources left. Also, the two networks from openVino only make 1.768 and 1.067 GFLOPS respectively so the sensor should keep up (it's an OAK-1 and should have 1.4 TOPS for AI). Are we doing something wrong?

            Thank you for your time
            Kind regards

            Simone

            • erik replied to this.