Yeah, there are more way to approach it. If you just use OAK as a camera, and do the processing on the Jetson, you could follow and implement something like this square detector.
On the other hand, neural network should also be able to detect this, given the shape and color is very distinct. In that case you would need a relatively good and diverse dataset. I would say you'd need at least 100 images for the first iteration, then train the model. After training, you should inspect the test results you get from the training on "test" data (data not used for training). If performance is not good enough, you should collect and annotate more images and repeat.