Back when I was tasked with building interactive voice response (IVR) systems with speech recognition, it was often said humans only understand 50% of words when they do not know the context.

In dreaming for my Raspberry Pi5 powered autonomous robot Dave to understand what he sees, I have been planning to (eventually) use transfer learning to enhance visual neural nets on objects he has “discovered”, but the name “Grok Vision” in the announcement by Elon Musk’s XAI company reminded me that integrating context into the neural net would likely improve object recognition significantly.

Since the YOLOv4 object recognition program running on Oak cameras only uses 8% of the RPi5, and the Oak-D-W-97 camera can perform the recognition at 30 FPS (w/o image transfers), there should be capacity for the next generation of vision recognition algorithms.

I think RTABmap is using a basic context derived from image features and pose to enhance vision localization.

Is there an object recognition with context blob I can run on my Oak camera?

    11 days later

    cycob

    Can you please explain in more details what you mean?

    RTABmap works differently than say standard object detection neural networks like YOLOv4. While the latter don't necessarily allow for context to be passed, there are ways of enhancing the detection performance. But I'd need to know more of what you are trying to achieve or what you are looking for.

      Matija Can you please explain in more details what you mean?

      YOLOv4 does not know that it is less likely to find a bicycle, a bird, or a bed "in a kitchen". These are examples of using external context.

      YOLOv4 does not use object context either. Some objects are commonly found together - table and chairs, people in squares (framed picture of family).

      I'm not disrespecting YOLOv4 - it is a demonstration that works very well. When I learn how to build my own "objects found in a home" I am hoping to find techniques to squeeze more objects with more confident recognitions into my autonomous home robot's limited processing. (Raspberry Pi5 and Oak-D-W-97)

        cycob

        Hey, thanks, I see what you mean now. Yes, unfortunately YoloV4 doesn't support this, but this should be implicitly (to some degree) encoded in the model. The model should be deep enough to "see" the whole picture and infer some context. I think there are more interesting approaches that could be explored here:

        • You could try training the model with additional classification head which would be predicting the "scene class". This should force the model to try and pay more attention to the scene itself, which could help with "context injection". At inference time you could even prune away the head so it would retain the same throughput.
        • There are some works like https://arxiv.org/pdf/1704.04224 where they try to inject some context based on spatial layout, but note that the improvements there are minimal (+2% compared to baseline).
        • I think a better paper is this https://arxiv.org/pdf/2402.17207v1 where they directly inject some priors in form of a matrix. It's used on a different type of architecture, so you would have to adapt the methodology to Yolo to support it on OAKs efficiently. Note again that the accuracy boost is not significant either (around +2% mAP, though mAP@50 is +6% which is significant enough).
        • Alternatively, you could focus on this priors also in the post-processing itself and exclude certain objects based on others that are present and more common. To adapt to your example, if you'd see plants, cups, and spoon, you could find some logic to decrease the confidence on a bike in the post-processing itself.

        I'd say there are probably more options out there that would have to be explored. The performance boost might be questionable as mentioned since objects that are more likely to appear together also appear together during the training. I think performance on OAKs would currently benefit more if we would to inject "past" context so that the models could use past infromation when making new predictions. This should smooth out the detections and also make them more confident.

        Note that there are also more modern object detectors than YoloV4 that should be able to achieve better performance.

        But I do think this is an interesting concept and topic to explore, so if you find some related work that would be interesting, we'd be happy to explore this further to some degree.