• What is the fastest way to diff 2 depth frames on device?

I am working on people tracking using OpenVINO models. There are many sensors covering a large indoor area with different hallways/rooms/etc.

I am using these models to help track a person.
1. Using person detection to detect a person
2. Using person reidentification to reid a person across sensors
3. Using face detection to detect a person (person detection doesn't work for sitting people or with hands in the air)

These algorithms are still not enough to have confident person tracking. The person detection does not get people sitting down or with hands in the air. The face detection doesn't work if peoples heads are facing down looking at their phones. So it is still very leaky for knowing when people are in a room.

These sensors are looking at a static scene, so it has a "control" depth reading about the environment (when no one is in it). When someone moves through that environment, obviously there is a change in depth reading.

I want to store a "control" depth frame on the device (using RAM) and then compare the incoming depth frame to that "control" depth frame, and see what the diff is between the two depth frames.

If I have a person detection that "stops" detecting (because they sat down, raised their arms, or are looking at their phones) I want to be able to check the "control" depth to see if there is still a person there (is there a big difference from the "control" frame for where the person was last detected (it would mean they are still there but might have sat down).

Is there a "best" approach to diff depth frames on device? I don't want to use OpenCV because that would be very slow processing and I was wondering if there was any hardware accelerated approach that could be used? I know that to calculate disparity there is a lot of analysis between mono cameras so is that something I can tap into to diff a depth frame stored in memory from a new depth frame?

    Hi AdamPolak
    Good idea to check the difference between frames to detect a person, but I don't think this is possible (on device) nor is it very efficient. The problem with the approach is the person in the lighting would have to be constant unless you use some averaging alglrithm, which would make the whole thing even slower. You could do it (maybe), if you feed your control image through xlinkIn to the stereodepth, along with one of your mono cameras and look for disparity. It should be 0 (ideal) everywhere except where the person is (would get invalid pixels probably). Let me know if you try it and it somehow works.

    Alternatively, an easier option would be to find a more generic person detection model that works on standing, sitting, laying, etc people. You can also add your own data to the dataset to improve it.

    Hope this helps,
    Jaka

      jakaskerl

      What do you mean "in the lighting"

      I plan on using this when a detected person "disappears" (isn't detected). I would want to check the depth frame where they disappeared, and compare it to the "control" depth frame. If the depth frame in that local region (where they were last detected) is deviated far enough from the "control" depth frame, we can reasonable assume the person is still there.

      I think the confusion is that I want to diff depth images (so lighting wouldn't be affected).

      I want to do a quick "diff" on 2 different depth images. Can the stereodepth node do that?

      If I put 2 depth frames into a would the StereoDepth node calculate the difference of pixel values in each pixel? (that is what I am looking for and deviations can be used to add confidence that a person is there)

        Hi AdamPolak
        Depth probably won't work with StereoDepth due to already noisy nature of depth maps. You can however run subtraction of two depth frames with numpy -- it would not run on the device, but is relatively fast and could work in your case.

        Regards,
        Jaka

          jakaskerl

          I can't send the depth frames to the host because I have a couple dozen sensors and it would blow up the bandwidth.

          Is there now way I can configure the StereoDepth pipeline to only subtract the 2 different frames provided from each other?

          I could send it 2 different disparity maps and I am looking for a resulting frame that is the difference in pixel values of both.

          This model used for facial detection post-processing seems to do some simple calculations and takes advantage of using accelerate hardware to do it: https://github.com/geaxgx/depthai_yunet/blob/fdc53e50dc46c491973daba6ab0b43de48ed024d/YuNetEdge.py#L264

          Is there a similar "ML model that is not really a model but does computation" exist to find the difference in pixel values between 2 frames?

          erik

          this is great thank you. I will see if I can edit to intake depth frames. Is this low latency to do a diff?

          erik

          Also more important question. How do we guarantee the frames are synch'd? There doesn't need to be a nn script sync node on this (the concat example has them just streaming to the nn)

          erik

          Very helpful thank you.

          I am about at my wits end here with trying to get the results from ->getFirstLayerFp16() back into a depth frame I can display.

          I just put it through the nn and return floats which are in the expected range, 0 -65535

          For some reason, every method to get this into a cv::Mat and display it do not work. It is either scrambled or a black frame.

          Sometimes a float from getFirstLayerFp16 is inf, and I wonder if that is a clue.