Anyone know what kind of interpolation is performed for the ISP scaling? If it's nearest-neighbor, bilinear, bicubic, or otherwise?
robotaiguy

- Apr 22, 2024
- Joined Sep 12, 2023
- 0 best answers
I've been doing considerable research on this over the past week, myself. And unfortunately, I was reminded of something I learned in Physics - Optics class about a phenomenon called "Diffraction Limit" and "Airy Disks".
Generally speaking, the diffraction limit is reached when the Airy disc exceeds 2 times the pixel pitch of the sensor. But it's also variable by the wavelength of the light, so most calculations I see seems to use a 520nm green light for examples. I wonder if we use use a lens assembly with a larger aperture, if it would help with some of our image issues. On that topic, I think that the more greenish tint in the IMX582 images has more to with this diffusion-limitation exacerbated by the RGGB pattern from the Bayer filter...with twice as many green pixels in the pattern, and the diffusion limit hits first for green to red spectrum. Cool light filtering to lean more towards blue light should also help.
But please do your homework on what I'm remembering. There's some good articles available.
- In IMX582 HDR
IF there's any update on this, I'd love to know as well. I'm experimenting with your IMX582 also.
I think maybe the best you can do for rebooting the OAK in standalone mode, is if your POE switch has a web API, you can use script node to send an HTTP message to power cycle the port it's connected to. I've done that with a few different switches, from Omnitron, Ubiquiti, and MIkrotik.
- Edited
UPDATE AND CORRECTION:
We received 2 new units from Amazon and Mouser, one with 0.0.22 and the other with 0.0.25, and both worked as expected.
Then we went ahead and upgraded the bootloader on both units to 0.0.26 and they each continued to work as expected.So the problem IS NOT the bootloader version, which is great news.
We have not identified the source of the issue. But we did find that within 5 new OAK-D-POE-PRO-AF units there were 3 different AF modules, although each sensor identifies as the Sony IMX378. The first and 3rd look like mirror opposites of each other, but if you look closely you can see that one looks like it wants to spin clockwise, and the other counterclockwise. And the 2nd one is completely different, but was the most frequently found. After I get back to the installation location, I can report back which one had the locked lens. The one on my bench, I have to redo the test because I failed to separate it from the pack and I'm unsure which of them it was now. But the one that's installed is the only one powered up on-site, so should be easy to tell after the 1st.
- Edited
Yes, we have used that same script. We even stripped it down to only have focus in it to make it simpler and still issues. Are you saying that you are able to move the lens position manually with an OAK-D-POE-PRO-AF on bootloader 0.26? How are you determining that it's moving the lens, by the image getting blurry and clearer? Or the value changing?
How about this? How can I get that bootloader 0.22 that does work for us? We can't find it published anywhere on the web...can't find it in the artifactory or anything. I can only find it on the units that came from the factory with it. I have a couple.
Connected to OAK at 192.168.254.112 with MX id: 18443010C1B79C0F00
Bootloader version: 0.0.22
bootloader type: Type.NETWORK
NETWORK Bootloader, is User Bootloader: False
Memory 'Memory.FLASH' size: 67108864, info: JEDEC ID: C2 25 3A
Application name: , firmware version: f033fd9c7eb0b3578d12f90302e87759c78cfb36
Memory 'EMMC' not available...How can I extract this version 0.0.22 from this OAK and save it to a file?
Because at this point, I'll give up the IMU sensor drivers for a short period until we can work this out, but every OAK-D-POE-PRO-AF that I have with this bootloader, works...the others that I upgraded to 0.0.26 don't work.Yesterday, we tested exactly what you suggested. We went back every release that had a bootloader change. We tried 0.21, 0.23, 0.24, and 0.26 and they all have the same issue. However, I bought 4 new OAK-D-POE-PROs with AF and they arrived with 0.22 bootloaders, and I can "setManualFocus(lens_pos)" and it will trigger the MOVE_LENS command to the lens_pos value. If I update the bootloader to any version aside from 0.22, it will autofocus when it starts, but using the setManualFocus(lens_pos) method only updates the lens position value, but doesn't trigger the MOVE_LENS command, and you don't hear the click.
Do you understand what I'm describing?I have tested it with OAK-1-POE, OAK-D-POE, and 2 OAK-D-POE-PROs with AF color cameras. After updating the bootloaders (OAK-D-POE and OAK-1-POE were on 0.21 previously, and it worked as expected) via device_manager.py, the AF functionality is very strange. The camera will autofocus one time....that's the first time it wakes up, and you can see it going back and forth in and out of focus until it locks into a lens position. After that first time, you never see it doing that again. And you can setManualFocus(lens_pos) and the ctrl.setAutoFocusMode(dai.CameraControl.AutoFocusMode.OFF), and then if you try to set for ANY lens position, it will record the change in the lens_pos variable, but it doesn't seem to actually be calling the MOVE_LENS function embedded within the wrapper. This repeated for each of the 4 cameras....once updated, they no longer react to any lens position change, other than the metadata
What's the easiest way to downgrade the bootloaders? I have found that 0.21 and 0.22 work fine. 0.26 definitely isn't working as expected for me.
Out of curiosity, what is the time-cost you're seeing for a crop? I noticed I was having difficulty initially simply because the pipeline nature of this device requires that it has more than one image frame at the same time so that there's always an ingress/egress flow going on, but I use isp scaling 3/4 because it's very little degradation yet it's only 56% of the original frame size.
I need to correct something after I conducted a study today with various setNumFramesPool settings. My pipeline is a little different than yours, as I'm going to videoencoder after manip, and you're going into nn...but the memory consumption up to this point should be similar enough to make sense. So this is going from sensor to 3/4 isp scaling to isp output to imagemanip for conversion to nv12 for videoencoder input, where I also set various framepool values.
Not setting setNumFramesPool() at all defaults to setting setNumFramesPool(raw=3, isp=3, preview=4, video=4, still=4). I expected that since it was creating those frame pools, that it would consume resources, but I was incorrect. The amount of DDR consumed for (3, 3, 4, 4, 4) was the same as for (3, 3, 0, 0, 0) when I use isp output. Notice however, that the raw has to be similar to the isp, or it will be starved, because the isp comes from the raw. I got the most efficient results with (3, 3, 0, 0, 0) and setting in videoencoder numFramesPool=3.
But I highly recommend adding this to your pipeline and it starts to make you think of all kinds of options:
# Set logging level device.setLogLevel(level=dai.LogLevel.DEBUG) device.setLogOutputLevel(dai.LogLevel.DEBUG) # Serialize pipeline to JSON for visualization with open(f"output_data/pipeline_{time_now}.json", "w", encoding="utf-8") as f: dump(pipeline.serializeToJson(), f, ensure_ascii=False, indent=2)
So you have your sensor resolution set at "THE_12_MP" and isp scale at 1, 2 (or none at all), and then link isp out as manip input image, then you crop the roi? I would expect this to pass a 2028x1520 pixel image of the full sensor to manip, which should then crop from those dimensions and output your roi.
And are you saying that preview out downsamples the whole frame instead of cropping from center? Preview out and ISP scaling are not mutually exclusive, for anyone reading this. So you can use ISP scaling to scale the full frame to a lower resolution, and then still use preview to center crop an RGB/BRG image from that newly scaled resolution. Use setPreviewSize to set the crop size of the preview out from the isp scaled size.If I understood your first post, you were wanting to capture the central 1920x1080 pixels, but output the same quality as isp output, and do so as efficiently as possible?
If that's the case, setting sensor resolution to 1080P, setIspScaling(1, 1), and link camRgb.isp.out should get you a center-cropped 1080P pixels from the full frame, with no interpolation, at yuv420p. However, wehre you go next will determine what happens with yoru image quality. If you go to manip and set pixel format to bgr then you're interpolating up to 24 bits per pixel from 12, which might result in fuzzier image. If you have a host connected, another option is to use script node to output to the host, perform pillow bicubic transforms instead of nearest neighbor, as depthai does, and then pipe the image back into the oak for inference. Image quality will be superior, but there is a round trip to consider for time. If you're on USB3, it can be less of an issue than POE. I'm still trying to figure out how to hack the library and recompile with bicubic resizes (just kiddding guys...but for real, help a brutha out.)
** Answering the question from 6 days ago in the original post (sorry I never actually addressed this):
ISP Scaling downsamples the image, such that the resultant image has the same FOV as the original, but with lower pixel density. Setting the sensor resolution center crops from the full frame at that selected resolution. So 1080P outputs the 1920x1080 pixels closest to the horizontal and vertical centers of 4056x3040, thus discarding everything outside of that region.
Aside from that, the sensor resolution doesn't change the pixel format...it will be yuv420p, 12 bits per pixel. the output you select does directly affect the pixel format:
preview = fake (interpolated) rgb/bgr, posing as 24 bits per pixel, but really still just half of that with guesswork in-between .
isp = yuv420p, 12 bits per pixel (best possible image quality aside from raw
video = nv12, 12 bits per pixel, but subsampled from yuv420 for efficiency at a cost in quality
still = nv12, subsampled 12 bits per pixel just like video out, but is allowed to be full frame size, whereas video is limited to 3840x2160, single shot that requires a boolean flag (setCaptureStill) raised to trigger a one-shot
raw = raw10. This packs 4 pixels across 5 bytes. This is considerably more complex, involving little Endian vs big Endian bit ordering considerations and bit shifting to unpack. If you want more info on this, I'd be glad to go deeper, but I don't think you're wanting to deal with this, as this would add much more overhead to the already taxed system.In case anyone gets really curious about what's happening on a block level, it's important to keep in mind that DepthAI uses 16 bit words, and while some functions will populate 0-7 and then 8-15, some populate 0-7 in one word and then go to the next word entirely, filling 0-7 for the next data, and so 8-15 on one word might be data completely unrelated to what's in the 0-7 bit locations on that same word. So be careful if using bit-shifting, and only do so when it's documented that the pixels in 8-15 are related to 0-7, as with raw10 unpacking. An example can be found here: raw10 unpacking example]
- Edited
Ok, I have done considerable research on this very topic. I'll share what I've found.
First, images are read from the sensor in a RGGB pattern from the implementation of a Bayer mosaic filter. This is what's called "raw". It has a bit depth of 8 bits per channel by default, so 24-bits per pixel. Optionally, you can choose to use 10-bit per channel, or 30-bit per pixel but you'll have to "pack" neighboring words (contiguous blocks of memory) such that you're using up a block and a half of memory for each image instead of a single block. Ok, now because the blocks have positions like 20, 21,^ 22, 23 , … , 2n, a couple things happen. The first word will get filled up from position 0 to position 15. Then the second word gets filled up halfway from 0 to 7. But think about this for a second…those values stored there aren't actually representing 20, 21, 22, etc….they are really continuing from 15….so 216, 217, and so on…so these are very large numbers. But then the ones populating positions 8-15 in the second word are actually representing digits that should be in positions 0-7…so there's an algorithm to use to manage this, so that you switch most and least significant bits, using little Endian and big Endian formatting and a neat little bit-shift trick. Then you can unpack them correctly later. So now that this is clear as mud, I'll move on to the next stages.
As you can probabyl tell, the raw images would be the largest for the memory to deal with. I don't really understand how people are actually successfully using raw out of the oak, when it only has 512MB of ram to use. More on that in a bit.
This is where the ISP comes in…actually, not really. Because it's not really the function of the ISP to downsample image frames, but rather the PSP, or post signal processor. But everybody seems to just call it the isp, and the api uses isp, so i'll continue with isp here. But if you're a little weirded out about how the isp is doing all the PID (proportional integral differential control) for determining auto white balance, auto exposure, and applying all of the filters requested in the pipeline definition, and still able to somehow re-encode frames that rapidly, it seems to be a separate process. That's not terribly important, except that it does bring to light the entire crux of the image issues. So the "isp" can be set to do scaling as you know..and there are rules…lots of rules…generally related to convolutional strides, and video encoding. But what's important to know here is that when the isp frame is created from the raw, it is a yuv420p pixel formatted image that is 12 bits per pixel because instead of using a red channel, a blue channel, adn a green channel in non-linear colorspace, which is then typically gamma corrected (or estimated) to linear lighting, instead, it also uses 3 channels, but one o fthem is luminance (grayscale), and then there are 2 chroma (color) channels, so that each pixel will have 8 bits of luminance and 4 bits of color. Because it has been found that human sight is more reactive to changes in lluminance than color. So if you use the isp output from the camera, jthen it will scale based on the numerator and denominator your input, FROM 4056x3040…however, for fun, set scale to 1/1 and I'll bet your resultant image size will be 4032x3040. But then if you use the very next ratio on the spreadsheet, it goes bacck to using the full sensor. What i do know is that this is due to every downstream processes you might want to use, will require a stride-32 compliance. That means the width will need to be evenly divisible by 32….so it seems that these folks do that first one internally for you….but then you're on your own for the rest. But now, one of those aSpect ratios is 0.75, and the other is 0.754….not much to worry about, but enough to mention it so you include that in your forward work.
All of the remaining image formats are derivative of the isp image. So let's talk about preview next. If you use preview, pay attention, because you'll notice that the other formats are also center-cropped from the isp frame. The preview is center-cropped from the isp image, as is the video output. As such, both of these outputs are limited to 4K size, well, more specifically, 3820x2160. The preview output has a pixel format like rgb, or you can specify the color order to be bgr if you're using opencv tools. HOWEVER, don't forgot that this is not true rgb. Remember, rgb is 8 bits per channel, for 3 channels, resulting in 24 bits per pixel. And remember that the preview output is a derivative of the isp image, which was only 12 bits per pixel in yuv420p format. So where does it come up with data to fill in all those extra pixels? Yep, you know….it's all cap…made up…fake….fugazi…interpolated. However, if you're using neural networks like YOLO, or using opencv tools, they will expect either rgb or bgr format, so you'd ned up needing to convert to this pixel format…just understand that it's not providing any better image quality.
Then we have video output. Video output goes another step further with messing with the image data by first converting the yuv420p pixel format to NV12. This is to enable efficient transportation of frames. The NV12 format "subsamples" the yuv420p by forcing pixels to share chroma bits. Remember, that it's already starting out by using half as many chroma bits as luminance bits. And then now we're sharing those chroma bits between pixels. This allows the video stream to be compressed, but also compromises image quality further. However, if you do want to stream frames, the video encoder node expects frames to be either NV12 or GRAY8 format…more on gray8 in a bit. So if you want to use the video encoder with color images, you'll want to use the video output and convert to nv12, or you want to addd an imagemanip node to convert to nv12.
The "still" output is pretty much like the isp output, only it's generated only after it receives a control message to do so. You trigger a still message by running the "setCaptureStill()" method, which sets a flag to True, and then the next time around the loop, the camera looks and aees that the flag is high, so it pushes a single isp frame out the "still" output and then lowers that flag down to False again.
Imagemanip nodes each require another thread and will consume a shave core. So you have 16 shave cores available to start with. By default, neural networks are set to use 6 shave cores and use 2 processes in parallel, so that's 12 shave cores. Assuming you'll need to use ImageManip prior to inputting into a neural network is probably a fair guess, so make that 13 shave cores. I don't care what you or I may read about being able tor run multiple neural networks in parallel and they magically share resources. If you're using Python, it's not happening, sorry. Python has a really cool (sarcasm) feature called the GIL. Do yall's research on the Global Interpreter Lock before you resume overwhoopin to folks about multiple networks at the same time. Running multiple threads in Python thinking they're running concurrently is crazy. Let's start there….
As far as unused output holding up resources, YES…YES they do…YES YES YES they do. You gotta do your homework on frame pooling. If you don't believe me, which you really shouldn't….I mean, don't believe anything some random guy on the Internet says about anything. Do your research and validate everything I say. You'll find that I'm right, but still….good practice to verify. So here's how you check this out for yourself. There's a really cool (not sarcasm) method, Pipeline.serializeToJson(). Add that to the end of your pipeline, and dump it out to a JSON file. Then use your favorite pretty formatter, and see what you see. I know…I know…it hurts…there's ways around it though. There's a lot that can be one…you can adjust the frame pools….you can add additional OAK's into the circuit…you can reduce the a3fps…you can go low-latency and pass references….you can commit to using script node and write everything in pure python and make your own processes. But that's for another day.l
So there's quite a bit more about this topic that's available to learn, and I've oversimplified a lot of it here to fit the constraints and context of this message thread. But I hope this can be of some use to someone. Feel free to hit me up with any further comments, questions, or concerns. No, I don't work for Luxonis, and my opinions are my own…but I am a customer, supporter, and fan.^Hitching along with Erik's instinct on this, it made me recall something similar early on, before I learned how to deal with it. If I was on my laptop with a WIFI connection for internet, and then connected an Ethernet cable, either to an onboard or USB or Thunderbolt Ethernet adapter, then the WIFI and the wired LAN connection would both keep alternate dropping constantly, and it would seem like my internet was horrible and my LAN more complex than it was. The bottom line was that they can't work like that. For my Macbook, I had to learn about "Set Service Order" to ensure that WIFI was always listed at the top of the list, and for Linux, I had to find out how to setup my wired LAN connection to "only use for resources on this network" (if I remember correctly, it's in the "routes" section in "Kubuntu" and similar, and in Ubuntu..I can't remember exactly, because I do most everything in the terminal these days. For that, it's a modification of the NetworkManager:
nmcli connection modify <connection name> ipv4.never-default true
alexv what OAK device are you using?
Is there any way to do this? ImageManip is able to convert to YUV400, but it repeats the same values across 3 channels instead of squeezing to 1 channel. I'm looking for a throughput out of the device, so it doesn't help me to do it on a connected host. OAK-D-POE-PRO with script node in standalone, with additional TCP socket connection to host.
Is there a specific reason why you're not using a VideoEncoder node? If you create that node and set it to MJPEG, you can stream YUV420 (either from NV12 video output or yuv420p from isp output) and that stream can be set to output to a container, which is just a file that continues to grow as you input to it, so you need to consume from it, either by stopping the pipe and converting to mp4/mkv, or stream from it, or pipe it to a neural network.
Of course, this is 12 bit per pixel instead of RGB's 24 bit per pixel, so if that extra bit-depth is absolutely necessary for you, then you can still pipe out frames, but at a much lower rate, as you're probably experiencing.
FYI: I have performed some detections on custom data on YUV420 and YUV400 frames (12 bit and 8 bit per pixel, respectively) that I would have bet wouldn't have been detected. And these were input images where some had the sun blooming the sensor, some had dust all over, some had terrible chromanoise at night because a lamp went out, some of them were during the rain.I can confirm that I can still produce these artifacts in the current version of depthai. it is easiest to see this if you setup a "frame forwarding" application that uses script node to oscillate between using an imagemanip node for yuv400 and an imagemanip node for yuv420, keeping as much as possible the same. to mitigate any smoothing effect that my eyes/brain might do (because my old eyes often play tricks on me), I reduced FPS to 1 and used an even/odd modulo operator to determine which imagemanip to use.
If you do this, you will see that the offset shift does not appear to be linearly distributed throughout the image. Some objects in view are offset more, and some hardly seem to be offset at all. My colleague and I were looking at this a couple weeks ago, and I can't recall if we determined that it was dark-edged objects that appeared to shift, or if it was something else. At my next opportunity, I'll setup a MRE with frame-forwarding.Now, if you can handle any noise in your system, and can manage your own QA of each frame, you can try transmiting over UDP instead of TCP. In that case, you can forget anything else I said about latency, because UDP requires NO acks at all. Jumbo frames FLIES with UDP. But, you have no way of knowing if your frames are going to come through or not. And if you start missing packet frames, your software may very well end up stitching packet frames from one image frame in with packet frames from another image frame. I know a little about this…see attached image, lol
- Edited
And looking at your elapsed times for each frame (I should clarify that i'm referring to "packet frame", not image frame here…HUGE difference), when you're running with 1500 packet size, it's 49ms, but when you go up to 9K (actually 15k in your circumstance), you're seeing 325ms each. At least you're only tryign to push 1200x800, right? I"m assuming 800 signifies 800p? That's 960k pixels, and if you're using preview output for BGR pixel formatted frames to use with imshow, for instance, then we're talking about 24 bits per pixel, so about 2.9MB per frame, x 3 cameras? 8.7ishMB per frame x 20 fps = 174 MBps one way. That's 1.4Gbps right there. If you don't have a 10GBaseT uplink connection to the host, there's another source of delay.
Now, it would definitely seem attractive to split this up into 14kB chunks rather than 2kB chunks, but the problem comes with the TCP protocol's want to perfectly adhere to ack requirements….to keep perspective on this, we're generally talking abotu between a 1/4 and 1/2 a second delay for every misaligned ack. There's simply no possible way to recover from that.
Take my situationm, for instance. I'm streaming out full frame images at 12.3MP, but Ii'm compressing to 12 bits per pixel (NV12) and then I'm going further and compressing the Y and U channels to be semi-planar for transportation purposes, which brings that down to around maybe 9 bits per pixel average (I shouldn't even do this because YUV throughput is heavily dependant on the brightness of pixels in the image…at night, it's a rocket. But 12.3MP times 9 bits per pixel is 110Mb per frame, or 13.9MB per frame. Now, if I'm. using a switch with a 10Gb uplink and 1Gb camera ports, 10 fps would put me over the "theoretical" maximum, let alone the practical. So I have to employ even further specialized compression techniques, or lower my framerate expectations or lower resolution expectations, or break out my wallet in a pretty substantial way.
Even if we were only talking about theoretical maximums, the physics doesn't work out. And even if your network was the perfectly ideal candidate for jumbo frames, it's still only about a 6% increase in throughput…but if you're even possibly not perfectly ideal for the Nagle algorithm, the performance hit is unrecoverable. Let that sink in.This is a god article that gives even more granularity towards your packet overhead and waht it's comprised of, and why it's showing as 66 bytes on your wireshark.
https://www.cablefree.net/wireless-technology/maximum-throughput-gigabit-ethernet/