sensor scaling vs resolution setting

TThor · Dec 11, 2023

is there any difference in terms of performance and image quality (not talking about the number of pixels on the isp output) between these two settings:

sensor: THE_12P, scaling 1/2

sensor: THE_1080P, scaling none

How does the sensor scaling parameter work, compared to the scaling done eg for the video output? There has to be a difference, because the number of cmx slices used in the 2 above cases is different

Also, is it possible to turn off resources (CPU, memory etc.) used for video and preview outputs, if I don't use them?

jakaskerl · Dec 11, 2023

Hi Thor
Using a higher resolution sensor make image acquisition slower - will hinder the performance. But if you use a higher resolution with ISP, you will get a better looking image that if you were to use a lower res and no ISP. So it's a tradeoff between performance and quality.

The sensor resolution setting determines the number of pixels that the sensor will capture. For example, a resolution setting of THE_12P means that the sensor will capture 12 million pixels, while a setting of THE_1080P means that the sensor will capture approximately 2 million pixels. The higher the resolution, the more detail the sensor can capture, but the more processing power and memory it requires.

The sensor scaling parameter, on the other hand, determines how the sensor's output is scaled before it is sent to the ISP (Image Signal Processor). A scaling parameter of 1/2 means that the sensor's output is halved in both dimensions, reducing the number of pixels by a factor of four. This can improve performance by reducing the amount of data that needs to be processed, but it can also reduce image quality by causing a loss of detail.

The number of CMX slices used is different in the two cases you mentioned because the amount of processing required by the ISP depends on both the resolution and the scaling factor. The ISP needs to process more data for a higher resolution or a lower scaling factor, so it requires more CMX slices.

Thor Also, is it possible to turn off resources (CPU, memory etc.) used for video and preview outputs, if I don't use them?

I don't think the unused outputs are taking any resources. I think the output processing pipeline is built when you actually use the said output. Eg. when using preview, the ColorCamera node will create manipulate the image to create the video output and then derive the preview from it. If you are taking output from .isp, this should to my knowledge not happen.

Thanks,
Jaka

robotaiguy · Dec 15, 2023

Ok, I have done considerable research on this very topic. I'll share what I've found.
First, images are read from the sensor in a RGGB pattern from the implementation of a Bayer mosaic filter. This is what's called "raw". It has a bit depth of 8 bits per channel by default, so 24-bits per pixel. Optionally, you can choose to use 10-bit per channel, or 30-bit per pixel but you'll have to "pack" neighboring words (contiguous blocks of memory) such that you're using up a block and a half of memory for each image instead of a single block. Ok, now because the blocks have positions like 2⁰, 2¹,^ 2², 2³ , … , 2ⁿ, a couple things happen. The first word will get filled up from position 0 to position 15. Then the second word gets filled up halfway from 0 to 7. But think about this for a second…those values stored there aren't actually representing 2⁰, 2¹, 2², etc….they are really continuing from 15….so 2¹⁶, 2¹⁷, and so on…so these are very large numbers. But then the ones populating positions 8-15 in the second word are actually representing digits that should be in positions 0-7…so there's an algorithm to use to manage this, so that you switch most and least significant bits, using little Endian and big Endian formatting and a neat little bit-shift trick. Then you can unpack them correctly later. So now that this is clear as mud, I'll move on to the next stages.
As you can probabyl tell, the raw images would be the largest for the memory to deal with. I don't really understand how people are actually successfully using raw out of the oak, when it only has 512MB of ram to use. More on that in a bit.
This is where the ISP comes in…actually, not really. Because it's not really the function of the ISP to downsample image frames, but rather the PSP, or post signal processor. But everybody seems to just call it the isp, and the api uses isp, so i'll continue with isp here. But if you're a little weirded out about how the isp is doing all the PID (proportional integral differential control) for determining auto white balance, auto exposure, and applying all of the filters requested in the pipeline definition, and still able to somehow re-encode frames that rapidly, it seems to be a separate process. That's not terribly important, except that it does bring to light the entire crux of the image issues. So the "isp" can be set to do scaling as you know..and there are rules…lots of rules…generally related to convolutional strides, and video encoding. But what's important to know here is that when the isp frame is created from the raw, it is a yuv420p pixel formatted image that is 12 bits per pixel because instead of using a red channel, a blue channel, adn a green channel in non-linear colorspace, which is then typically gamma corrected (or estimated) to linear lighting, instead, it also uses 3 channels, but one o fthem is luminance (grayscale), and then there are 2 chroma (color) channels, so that each pixel will have 8 bits of luminance and 4 bits of color. Because it has been found that human sight is more reactive to changes in lluminance than color. So if you use the isp output from the camera, jthen it will scale based on the numerator and denominator your input, FROM 4056x3040…however, for fun, set scale to 1/1 and I'll bet your resultant image size will be 4032x3040. But then if you use the very next ratio on the spreadsheet, it goes bacck to using the full sensor. What i do know is that this is due to every downstream processes you might want to use, will require a stride-32 compliance. That means the width will need to be evenly divisible by 32….so it seems that these folks do that first one internally for you….but then you're on your own for the rest. But now, one of those aSpect ratios is 0.75, and the other is 0.754….not much to worry about, but enough to mention it so you include that in your forward work.
All of the remaining image formats are derivative of the isp image. So let's talk about preview next. If you use preview, pay attention, because you'll notice that the other formats are also center-cropped from the isp frame. The preview is center-cropped from the isp image, as is the video output. As such, both of these outputs are limited to 4K size, well, more specifically, 3820x2160. The preview output has a pixel format like rgb, or you can specify the color order to be bgr if you're using opencv tools. HOWEVER, don't forgot that this is not true rgb. Remember, rgb is 8 bits per channel, for 3 channels, resulting in 24 bits per pixel. And remember that the preview output is a derivative of the isp image, which was only 12 bits per pixel in yuv420p format. So where does it come up with data to fill in all those extra pixels? Yep, you know….it's all cap…made up…fake….fugazi…interpolated. However, if you're using neural networks like YOLO, or using opencv tools, they will expect either rgb or bgr format, so you'd ned up needing to convert to this pixel format…just understand that it's not providing any better image quality.
Then we have video output. Video output goes another step further with messing with the image data by first converting the yuv420p pixel format to NV12. This is to enable efficient transportation of frames. The NV12 format "subsamples" the yuv420p by forcing pixels to share chroma bits. Remember, that it's already starting out by using half as many chroma bits as luminance bits. And then now we're sharing those chroma bits between pixels. This allows the video stream to be compressed, but also compromises image quality further. However, if you do want to stream frames, the video encoder node expects frames to be either NV12 or GRAY8 format…more on gray8 in a bit. So if you want to use the video encoder with color images, you'll want to use the video output and convert to nv12, or you want to addd an imagemanip node to convert to nv12.
The "still" output is pretty much like the isp output, only it's generated only after it receives a control message to do so. You trigger a still message by running the "setCaptureStill()" method, which sets a flag to True, and then the next time around the loop, the camera looks and aees that the flag is high, so it pushes a single isp frame out the "still" output and then lowers that flag down to False again.
Imagemanip nodes each require another thread and will consume a shave core. So you have 16 shave cores available to start with. By default, neural networks are set to use 6 shave cores and use 2 processes in parallel, so that's 12 shave cores. Assuming you'll need to use ImageManip prior to inputting into a neural network is probably a fair guess, so make that 13 shave cores. I don't care what you or I may read about being able tor run multiple neural networks in parallel and they magically share resources. If you're using Python, it's not happening, sorry. Python has a really cool (sarcasm) feature called the GIL. Do yall's research on the Global Interpreter Lock before you resume overwhoopin to folks about multiple networks at the same time. Running multiple threads in Python thinking they're running concurrently is crazy. Let's start there….
As far as unused output holding up resources, YES…YES they do…YES YES YES they do. You gotta do your homework on frame pooling. If you don't believe me, which you really shouldn't….I mean, don't believe anything some random guy on the Internet says about anything. Do your research and validate everything I say. You'll find that I'm right, but still….good practice to verify. So here's how you check this out for yourself. There's a really cool (not sarcasm) method, Pipeline.serializeToJson(). Add that to the end of your pipeline, and dump it out to a JSON file. Then use your favorite pretty formatter, and see what you see. I know…I know…it hurts…there's ways around it though. There's a lot that can be one…you can adjust the frame pools….you can add additional OAK's into the circuit…you can reduce the a3fps…you can go low-latency and pass references….you can commit to using script node and write everything in pure python and make your own processes. But that's for another day.l
So there's quite a bit more about this topic that's available to learn, and I've oversimplified a lot of it here to fit the constraints and context of this message thread. But I hope this can be of some use to someone. Feel free to hit me up with any further comments, questions, or concerns. No, I don't work for Luxonis, and my opinions are my own…but I am a customer, supporter, and fan.^

TThor · Dec 15, 2023

Thank you, very informative. I'm currently feeding the NN with the ISP output passed through and imagemanip to crop to NN frame size and convert to BGR. At least I (maybe) get a better "image quality" to the NN as I don't need the full image while preview shrinks all of it.
However it would be nice if the OAK could take the same time to extract a subset (crop) from a full sensor frame vs a sensor set to 1080P. I guess it lacks some hardware support to do that

robotaiguy · Dec 16, 2023

So you have your sensor resolution set at "THE_12_MP" and isp scale at 1, 2 (or none at all), and then link isp out as manip input image, then you crop the roi? I would expect this to pass a 2028x1520 pixel image of the full sensor to manip, which should then crop from those dimensions and output your roi.
And are you saying that preview out downsamples the whole frame instead of cropping from center? Preview out and ISP scaling are not mutually exclusive, for anyone reading this. So you can use ISP scaling to scale the full frame to a lower resolution, and then still use preview to center crop an RGB/BRG image from that newly scaled resolution. Use setPreviewSize to set the crop size of the preview out from the isp scaled size.

If I understood your first post, you were wanting to capture the central 1920x1080 pixels, but output the same quality as isp output, and do so as efficiently as possible?
If that's the case, setting sensor resolution to 1080P, setIspScaling(1, 1), and link camRgb.isp.out should get you a center-cropped 1080P pixels from the full frame, with no interpolation, at yuv420p. However, wehre you go next will determine what happens with yoru image quality. If you go to manip and set pixel format to bgr then you're interpolating up to 24 bits per pixel from 12, which might result in fuzzier image. If you have a host connected, another option is to use script node to output to the host, perform pillow bicubic transforms instead of nearest neighbor, as depthai does, and then pipe the image back into the oak for inference. Image quality will be superior, but there is a round trip to consider for time. If you're on USB3, it can be less of an issue than POE. I'm still trying to figure out how to hack the library and recompile with bicubic resizes (just kiddding guys...but for real, help a brutha out. )

** Answering the question from 6 days ago in the original post (sorry I never actually addressed this):
ISP Scaling downsamples the image, such that the resultant image has the same FOV as the original, but with lower pixel density. Setting the sensor resolution center crops from the full frame at that selected resolution. So 1080P outputs the 1920x1080 pixels closest to the horizontal and vertical centers of 4056x3040, thus discarding everything outside of that region.
Aside from that, the sensor resolution doesn't change the pixel format...it will be yuv420p, 12 bits per pixel. the output you select does directly affect the pixel format:
preview = fake (interpolated) rgb/bgr, posing as 24 bits per pixel, but really still just half of that with guesswork in-between .
isp = yuv420p, 12 bits per pixel (best possible image quality aside from raw
video = nv12, 12 bits per pixel, but subsampled from yuv420 for efficiency at a cost in quality
still = nv12, subsampled 12 bits per pixel just like video out, but is allowed to be full frame size, whereas video is limited to 3840x2160, single shot that requires a boolean flag (setCaptureStill) raised to trigger a one-shot
raw = raw10. This packs 4 pixels across 5 bytes. This is considerably more complex, involving little Endian vs big Endian bit ordering considerations and bit shifting to unpack. If you want more info on this, I'd be glad to go deeper, but I don't think you're wanting to deal with this, as this would add much more overhead to the already taxed system.

In case anyone gets really curious about what's happening on a block level, it's important to keep in mind that DepthAI uses 16 bit words, and while some functions will populate 0-7 and then 8-15, some populate 0-7 in one word and then go to the next word entirely, filling 0-7 for the next data, and so 8-15 on one word might be data completely unrelated to what's in the 0-7 bit locations on that same word. So be careful if using bit-shifting, and only do so when it's documented that the pixels in 8-15 are related to 0-7, as with raw10 unpacking. An example can be found here: raw10 unpacking example]

TThor · Dec 16, 2023

Thank you. You got to the bottom of the issue, but I'm still not sure I understand if there is a reasonable solution or not. Probably not and I'll have to live with what I've got.
1) I'd like to take 640x640 pixels from the 2028x1520 image, not necessarily centered. To do that, I tried setting the sensor to THE_12_MP and crop it from isp using manip. There problem I got, it takes too long. So I had to set the sensor to 1920x1080, and crop that from isp using manip. Takes a lot less time but it's a disappointment
2) NN only works with RGB or BGR (depending on the how the model was trained with which dataset). So I have to tell manip to convert the pixel format, which results in a degradation of the image and therefore a degradation of the NN performance, maybe because the manip implementation of the conversion is not optimal. Too bad.
So I'm not overly happy about 1) the time it takes to crop a full sensor frame 2) the image degradation form yuv420p to RGB or BGR
But it's probably the best I can do with the current hardware/software, unless you have a better suggestion?

robotaiguy · Dec 16, 2023

I need to correct something after I conducted a study today with various setNumFramesPool settings. My pipeline is a little different than yours, as I'm going to videoencoder after manip, and you're going into nn...but the memory consumption up to this point should be similar enough to make sense. So this is going from sensor to 3/4 isp scaling to isp output to imagemanip for conversion to nv12 for videoencoder input, where I also set various framepool values.

Not setting setNumFramesPool() at all defaults to setting setNumFramesPool(raw=3, isp=3, preview=4, video=4, still=4). I expected that since it was creating those frame pools, that it would consume resources, but I was incorrect. The amount of DDR consumed for (3, 3, 4, 4, 4) was the same as for (3, 3, 0, 0, 0) when I use isp output. Notice however, that the raw has to be similar to the isp, or it will be starved, because the isp comes from the raw. I got the most efficient results with (3, 3, 0, 0, 0) and setting in videoencoder numFramesPool=3.

But I highly recommend adding this to your pipeline and it starts to make you think of all kinds of options:

# Set logging level
device.setLogLevel(level=dai.LogLevel.DEBUG)
device.setLogOutputLevel(dai.LogLevel.DEBUG)
# Serialize pipeline to JSON for visualization
with open(f"output_data/pipeline_{time_now}.json", "w", encoding="utf-8") as f:
    dump(pipeline.serializeToJson(), f, ensure_ascii=False, indent=2)

TThor · Dec 17, 2023

Thank you, very informative. Bottom line, it seems there are no tricks to bring full frame cropping time to a reasonable level; the only viable way seems to be to initialise the sensor to a lower resolution.

robotaiguy · Dec 17, 2023

Out of curiosity, what is the time-cost you're seeing for a crop? I noticed I was having difficulty initially simply because the pipeline nature of this device requires that it has more than one image frame at the same time so that there's always an ingress/egress flow going on, but I use isp scaling 3/4 because it's very little degradation yet it's only 56% of the original frame size.

TThor · Dec 18, 2023

Hi Robotaiguy,
I did some tests a couple of weeks ago, but I didn't take note of the time/cost of the crop of a full resolution frame vs 3/4. I just remember that the frame rate dropped by half. I'll do more tests next week and report back