Ok, I have done considerable research on this very topic. I'll share what I've found.
First, images are read from the sensor in a RGGB pattern from the implementation of a Bayer mosaic filter. This is what's called "raw". It has a bit depth of 8 bits per channel by default, so 24-bits per pixel. Optionally, you can choose to use 10-bit per channel, or 30-bit per pixel but you'll have to "pack" neighboring words (contiguous blocks of memory) such that you're using up a block and a half of memory for each image instead of a single block. Ok, now because the blocks have positions like 20, 21,^ 22, 23 , … , 2n, a couple things happen. The first word will get filled up from position 0 to position 15. Then the second word gets filled up halfway from 0 to 7. But think about this for a second…those values stored there aren't actually representing 20, 21, 22, etc….they are really continuing from 15….so 216, 217, and so on…so these are very large numbers. But then the ones populating positions 8-15 in the second word are actually representing digits that should be in positions 0-7…so there's an algorithm to use to manage this, so that you switch most and least significant bits, using little Endian and big Endian formatting and a neat little bit-shift trick. Then you can unpack them correctly later. So now that this is clear as mud, I'll move on to the next stages.
As you can probabyl tell, the raw images would be the largest for the memory to deal with. I don't really understand how people are actually successfully using raw out of the oak, when it only has 512MB of ram to use. More on that in a bit.
This is where the ISP comes in…actually, not really. Because it's not really the function of the ISP to downsample image frames, but rather the PSP, or post signal processor. But everybody seems to just call it the isp, and the api uses isp, so i'll continue with isp here. But if you're a little weirded out about how the isp is doing all the PID (proportional integral differential control) for determining auto white balance, auto exposure, and applying all of the filters requested in the pipeline definition, and still able to somehow re-encode frames that rapidly, it seems to be a separate process. That's not terribly important, except that it does bring to light the entire crux of the image issues. So the "isp" can be set to do scaling as you know..and there are rules…lots of rules…generally related to convolutional strides, and video encoding. But what's important to know here is that when the isp frame is created from the raw, it is a yuv420p pixel formatted image that is 12 bits per pixel because instead of using a red channel, a blue channel, adn a green channel in non-linear colorspace, which is then typically gamma corrected (or estimated) to linear lighting, instead, it also uses 3 channels, but one o fthem is luminance (grayscale), and then there are 2 chroma (color) channels, so that each pixel will have 8 bits of luminance and 4 bits of color. Because it has been found that human sight is more reactive to changes in lluminance than color. So if you use the isp output from the camera, jthen it will scale based on the numerator and denominator your input, FROM 4056x3040…however, for fun, set scale to 1/1 and I'll bet your resultant image size will be 4032x3040. But then if you use the very next ratio on the spreadsheet, it goes bacck to using the full sensor. What i do know is that this is due to every downstream processes you might want to use, will require a stride-32 compliance. That means the width will need to be evenly divisible by 32….so it seems that these folks do that first one internally for you….but then you're on your own for the rest. But now, one of those aSpect ratios is 0.75, and the other is 0.754….not much to worry about, but enough to mention it so you include that in your forward work.
All of the remaining image formats are derivative of the isp image. So let's talk about preview next. If you use preview, pay attention, because you'll notice that the other formats are also center-cropped from the isp frame. The preview is center-cropped from the isp image, as is the video output. As such, both of these outputs are limited to 4K size, well, more specifically, 3820x2160. The preview output has a pixel format like rgb, or you can specify the color order to be bgr if you're using opencv tools. HOWEVER, don't forgot that this is not true rgb. Remember, rgb is 8 bits per channel, for 3 channels, resulting in 24 bits per pixel. And remember that the preview output is a derivative of the isp image, which was only 12 bits per pixel in yuv420p format. So where does it come up with data to fill in all those extra pixels? Yep, you know….it's all cap…made up…fake….fugazi…interpolated. However, if you're using neural networks like YOLO, or using opencv tools, they will expect either rgb or bgr format, so you'd ned up needing to convert to this pixel format…just understand that it's not providing any better image quality.
Then we have video output. Video output goes another step further with messing with the image data by first converting the yuv420p pixel format to NV12. This is to enable efficient transportation of frames. The NV12 format "subsamples" the yuv420p by forcing pixels to share chroma bits. Remember, that it's already starting out by using half as many chroma bits as luminance bits. And then now we're sharing those chroma bits between pixels. This allows the video stream to be compressed, but also compromises image quality further. However, if you do want to stream frames, the video encoder node expects frames to be either NV12 or GRAY8 format…more on gray8 in a bit. So if you want to use the video encoder with color images, you'll want to use the video output and convert to nv12, or you want to addd an imagemanip node to convert to nv12.
The "still" output is pretty much like the isp output, only it's generated only after it receives a control message to do so. You trigger a still message by running the "setCaptureStill()" method, which sets a flag to True, and then the next time around the loop, the camera looks and aees that the flag is high, so it pushes a single isp frame out the "still" output and then lowers that flag down to False again.
Imagemanip nodes each require another thread and will consume a shave core. So you have 16 shave cores available to start with. By default, neural networks are set to use 6 shave cores and use 2 processes in parallel, so that's 12 shave cores. Assuming you'll need to use ImageManip prior to inputting into a neural network is probably a fair guess, so make that 13 shave cores. I don't care what you or I may read about being able tor run multiple neural networks in parallel and they magically share resources. If you're using Python, it's not happening, sorry. Python has a really cool (sarcasm) feature called the GIL. Do yall's research on the Global Interpreter Lock before you resume overwhoopin to folks about multiple networks at the same time. Running multiple threads in Python thinking they're running concurrently is crazy. Let's start there….
As far as unused output holding up resources, YES…YES they do…YES YES YES they do. You gotta do your homework on frame pooling. If you don't believe me, which you really shouldn't….I mean, don't believe anything some random guy on the Internet says about anything. Do your research and validate everything I say. You'll find that I'm right, but still….good practice to verify. So here's how you check this out for yourself. There's a really cool (not sarcasm) method, Pipeline.serializeToJson(). Add that to the end of your pipeline, and dump it out to a JSON file. Then use your favorite pretty formatter, and see what you see. I know…I know…it hurts…there's ways around it though. There's a lot that can be one…you can adjust the frame pools….you can add additional OAK's into the circuit…you can reduce the a3fps…you can go low-latency and pass references….you can commit to using script node and write everything in pure python and make your own processes. But that's for another day.l
So there's quite a bit more about this topic that's available to learn, and I've oversimplified a lot of it here to fit the constraints and context of this message thread. But I hope this can be of some use to someone. Feel free to hit me up with any further comments, questions, or concerns. No, I don't work for Luxonis, and my opinions are my own…but I am a customer, supporter, and fan.^