Track any moving object? (classification not needed)

Craftonix

Hi!

We have a requirement for counting any moving object with highest accuracy. An object can be counted if its centroid crosses a line, e.g.

if centroid_x > line_x:
    count += 1

(I know that the above code is subject to hysteresis, but I'm just providing it as a simple example)

We do not need classification, i.e we do not need to know if the object is a person, bike, car, or anything else.

Actually, we are counting people of all ages, sizes, ethnic backgrounds, babies in strollers, children, old people, etc. However, it seems that using YoloV8n with ObjectTracker is not detecting many people and not tracking them. So, since we know for sure that any moving object in the camera view is 100% definitely a person (this is indoors and we know that there are no cars, bikes, animals, etc.), I'm wondering if skipping the classification would improve our count. The system is counting about 1000 people per day, when the ground truth is that there are 5000-8000 people per day. When observing the Visualizer, I noticed that if a person is detected as a dog or cat, and if it crosses the line, then it is counted. This is fine and it is the desired behavior. This is what I mean by classification is not needed. However, if a person is not detected at all, neither as a person nor any other object, even though they are moving, then we are loosing the count because it is not being tracked.

If using something like this:

with OakCamera() as oak:
    cam = oak.create_camera('color')
    nn = oak.create_nn('yolov8n_coco_640x352', cam, tracker=True)
    nn.config_nn(resize_mode='stretch')

What is the best tracker to use for this use case? (SHORT_TERM_KCF, SHORT_TERM_IMAGELESS, ZERO_TERM_COLOR_HISTOGRAM, ZERO_TERM_IMAGELESS) ? I read all the docs and all the linked pages but I still don't understand fully how to choose this parameter.
Are there any other parameters of YoloV8n or ObjectTracker that can improve the accuracy?
Where can I find sample videos with lots of people of different demographics? (googling only shows the most popular videos that everyone seems to be using in CV/AI which are only a handful of people in an office or uniform demographic people in a retail mall, mostly adults)
In order to reduce the load on the OAK-D PoE device, is YoloV8n the best model for this use case?
Does it matter if the camera is directly above the people and pointing vertically down, or should we place the camera at a height of about 6 to 8 feet (3 or 4 meters) and aim it horizontally at the people?
When the camera is above people's heads and pointing vertically downwards, if people are close to each other (shoulders touching), then 2 or 3 people are detected and tracked as 1 person. Is there a way to improve this and detect them/track them all ?
The number of people in the FOV is about 10 people at a time, maximum 20 people. Is the OAK-D PoE Series 1 powerful enough to handle this load (given the limitation explained here).

Thanks!

jakaskerl

Hi Craftonix

This will require some experimentaition to see what provides the best overall performance. You are basically trading off performance for accuracy.

SHORT_TERM_KCF (Kernelized Correlation Filters) might offer good performance in tracking individuals but may consume more resources compared to, say, SHORT_TERM_IMAGELESS which might be lighter but potentially less accurate.

This mostly depends on the model. Since you said the model is not detecting people, perhaps the confidence threshold is too high, or NMS is incorrectly setup for the use-case.
I'd say look for a pederstrian datasets (like https://paperswithcode.com/dataset/citypersons) since they usually have more samples and people of different demographics.
I think v8 has the highest accuracy, but v6 is the fastest. If you find v8 takes too much of a toll on the device, you can switch the model, or train it for smaller inputs. https://docs.luxonis.com/projects/hardware/en/latest/pages/rvc/rvc2/#rvc2-nn-performance
Directly above will require datasets that feature same orientation. Not sure how many you will be able to find. The accuracy will likely be terrible if you train on horizontal view and then test it on vertical.
Train on datasets that have this scenario as well, then the model will know how to deal with it.
Not sure of the top of my head. I think performance will go down the more people are present in the frame. You need to make sure the pipeline doesn't crash (by tweaking the NN model and tracker type).

Thanks,
Jaka