This work aims to mathematically quantify image quality and performance of different real-time upscaling algorithms.
I want to make it clear though that you shouldn't take the results as gospel. Mathematical image quality metrics do not always correlate perfectly with how humans perceive image quality, and your personal preference is entirely subjective. You should take this page as what it is, a research that produces numbers, but you should not take these numbers for granted before understanding what they actually mean. In the context of image resampling, the primary problem is usually sharpness as reconstructing fine deteails accurately isn't a trivial task. For that matter, a sharper resampling filter will generally be numerically closer to the reference, even taking the usual side effects into consideration (ringing, blocking, aliasing, etc).
Being "numerically closer" to the reference does not always mean that it'll look closer to the reference from our perspective, and that's exactly why "image quality metrics" is still a research field to this day. The "state of the art" has even abandoned classic distortion-based metrics in favour of complex perception-based metrics in recent years, which are usually based on CNNs for image classification (computing the distance between the activations). I include LPIPS as a perceptual metric here just to make sure the results aren't too biased towards distortion, but keep in mind that these metrics were not originally designed to be measuring resampling quality.
If all you want is to look at the results, follow this link.
Like its name implies, upscaling is simply the act of increasing the scale of something. In the digital signal processing case, it's equivalent to increasing the number of samples for a given signal (which is usually called "resampling").
The easiest and most classic way of doing this is through simple linear interpolation, if you want to find a value between two points you can simply draw a line between them. Linear interpolation can be done in a plane, through both axis, creating what we call "bilinear" interpolation. Bilinear interpolation is the simplest interpolation algorithm, the easiest to compute and probably the most widespread one.
But can something as simple as just drawing a line between 2 points give us good results? Sometimes it does, sometimes it doesn't. It really depends on the signal. Instead of taking 2 points and drawing a line, we could take more than 2 points and draw a curve. The shape of the curve depends on the weights used in the calculation, and these weights depend on the chosen filter. The number of points in the calculation depends on the length/radius/support of the filter. If you want to understand how this is actually done, I suggest simply reading this explanation.
If you're interested in this topic I highly recommend reading my blog post about ML upscaling. There are multiple different machine learning based image upscalers, we'll stick to those that can be easily used as mpv shaders (RAVU, FSRCNNX and Anime4K).
We'll test the following shaders alongside the built-in filters found in mpv:
Quality measurements were done with an anime test image. Debanding was turned off to prevent loss of fine detail.
In order to measure how "good" each algorithm is, the test images were first converted to grayscale with imagemagick:
magick convert image.png -colorspace gray image-gray.png
They were then downscaled to a quarter of their resolutions (0.5x scaling factor) using the Catmull Rom filter:
magick convert image_gray.png -filter catrom -resize 50% downscaled.png
The images were then upscaled back by mpv:
mpv --no-config --profile=gpu-hq --deband=no --no-hidpi-window-scale --window-scale=2.0 --pause=yes --screenshot-format=png --no-scaler-resizes-only --scale=filter --glsl-shader=shader downscaled.png
Anime4K has a modular design and you can find instructions here. I used the "Remain as faithful to the original while enhancing details" configuration.
Since mpv writes PNGs in the rgb24 format, all screenshots were converted to greyscale using imagemagick as described above. This doesn't change how they're shown on screen, it simply reduces the number of channels from 3 to 1.
Performance was evaluated simply in frames per second, while quality was evaluated in PSNR, SSIM, MS-SSIM, IW-SSIM, PSNR-HMA, VIF and LPIPS.
All metrics were computed using their reference implementations. PSNR-HMA has its wstep set to 7.
This entire process should be easily reproducible.
The following test image was used for the line art upscaling tests:
Before we continue to the results, you can find all images in this repository. Including the greyscale variant and its downscaled counterpart.
We can see the results below:
We can clearly see that in general, machine learning based scalers tend to give better quality, though as expected they're also more computationally expensive. We can see that FSRCNNX is the highest scoring mpv scaler, followed by Anime4K and then RAVU. FSR seems particularly bad in the distortion metrics, but it scores well enough in LPIPS.
Between the simple scaling filters, Lanczos seems to be the best-scoring orthogonal filter while EWA_Robidouxsharp and EWA_Lanczossharp are the best-scoring elliptical ones. I personally like EWA_Lanczossharp way more than EWA_Robidouxsharp though, as it seems to be able to draw diagonal lines much better with way less aliasing (jagged edges). EWA_Lanczossharp is also better than orthogonal Lanczos at doing this by the way, but Lanczos is sharper and this probably causes it to score higher on most metrics (since the main problem with classic resampling is blurriness).
The following image is the result of a survey conducted by Don P. Mitchell and Arun N. Netravali, alongside 9 other digital image processing experts with the intention of classifying the bicubic filters via the B and C parameters.
The variant known as "Mitchell-Netravali" nowadays is the filter you get with B=C=1/3, and it is mpv's
I didn't exactly expect the "numerically better" filters to actually match subjective choices, but Imagemagick's default orthogonal resampling filter is Lanczos for images without transparency. Likewise, Imagemagick uses EWA_Robidoux by default for distortion operations.
Performance tests were done with an 720p 4:2:0 8 bit H.264/AVC input video:
mpv --no-config --profile=gpu-hq --vo=gpu-next --gpu-api=vulkan --audio=no --untimed=yes --video-sync=display-desync --vulkan-swap-mode=immediate --no-hidpi-window-scale --window-scale=2.0 --scale=filter --glsl-shader=shader input.mkv
I've decided to go ahead and benchmark using libplacebo since it provides better performance and because I believe it'll eventually replace vo=gpu.
We can take a few things from this, orthogonal resampling is still faster than elliptical resampling by a significant margin. RAVU, FSR and the small variant of Anime4K are extremely fast, beating or matching ewa_lanczossharp. FSRCNNX_16 and the biggest variant of Anime4K are still reasonably easy to run as long as you have a semi-decent discrete GPU. I've benchmarked using an RX470, which is far from being "fast" by today's standards.
When it comes to the built-in filters, you can expect polar resampling to be slower than orthogonal resampling, and you can also expect larger filters to be slower.
Taking into consideration anime productions are still done at arbitrary resolutions between 1280x720 and 1920x1080 (with most below 1600x900, check Anibin for details) and that most users are still watching them on FHD 1920x1080 displays, the most common scenario when there's any luma upscaling (or rgb upscaling if you want to nitpick about it, since mpv merges luma and chroma right after doubling chroma) being performed by the player is arguably 720p->1080p, which corresponds to a 1.5x scaling factor. With the recent addition of ravu-zoom to our ever growing arsenal of user shaders, and the need to find out which dscales work better alongside the doublers, adding a new series of measurements seems reasonable.
The methodology is straightforward, violet-gray.png was downscaled to 1280x720 with imagemagick:
magick convert violet-gray.png -filter catrom -resize 1280x720 downscaled.png
This is the equivalent of downscaling with mpv using
The downscaled image was then brought back up to 1920x1080 using the several shaders. For the doublers, the following downscale was
performed with
The following tables display the results. The filter specified after the doubler is the downscaling filter used from 2560x1440 to 1920x1080.
You can find all the images in this repository.
Well, we can see that VIF clearly likes RAVU more than the other metrics. We can also see that the mitchell filter scores poorly when compared to lanczos or catrom regardless of doubler, and this can be easily attributed to the fact that it is significantly blurrier than the other 2. Lanczos is the sharpest downscaling filter between the 3 sane options tested here, and at the 0.75x scaling factor (1440p->1080p) its pronounced ringing artifacts are not "bad" enough to make it rank lower than catrom. Catrom is almost as sharp but it introduces less ringing to the output.
Overall the tables tell a story that clearly puts FSRCNNX on top, and more specifically the LineArt variant. RAVU-zoom is trading blows with RAVU-lite depending on the metric, and SSSR also shows up as a good "3rd place" contender. I didn't feel the need to include all the simple filters, Lanczos scored the best in the 2x upscaling test so it is the only one showing up again here.
The sRGB colour space does not have a linear gamma curve, it's actually (approximately) a power 2.2 function to compensante how the human visual system works giving more quantisation steps to the darker tones. This means scaling in sRGB gamma "blends" perceptual values instead of doing the arithmetic in linear light and then converting it back. This is not always bad, but it can create accentuated ringing artifacts towards the brighter tones. Doing it in linear light accentuates dark ringing artifacts though, which is the exact opposite and usually looks even worse in my opinion.
Taking this into consideration, the gimmicky yet clever technique of scaling in sigmoidal light was "developed". Sigmoid light treats dark and bright overshoots the same, as the quantisation precision is equal in both extremes.
The following test compares linear light, sRGB light and sigmoidal light on a simple 2x orthogonal upscaling test with the Lanczos filter.
A new test image, Kanao, was reused. It was downscaled to 960x540 so we can upscale it back with a 2x scaling factor. The box filter was used to avoid creating pixels out of the original range and to avoid ringing artifacts in the downscaling process.
The sRGB light upscaling was done with:
magick convert kanao_box.png -filter lanczos -resize 200% kanao_srgb.png
The linear light upscaling was done with:
magick convert kanao_box.png -colorspace RGB -filter lanczos -resize 200% -colorspace sRGB kanao_linear.png
The sigmoidal light upscaling was done with:
magick convert kanao_box.png -colorspace RGB +sigmoidal-contrast 6.5,50% -filter lanczos -resize 200% -sigmoidal-contrast 6.5,50% -colorspace sRGB kanao_sigmoid.png
You can find all the images in this repository.
As we can see, all three relevant metrics seem to agree sigmoidal light is better. The gpu-hq profiles comes with
The following image shows the difference (images have been scaled up with nearest neighbour for easier visualisation, open in a new tab at 100% scaling to see it).
Chroma subsampling is a technique utilised to save bitrate/filesize without strongly sacrificing perceived quality, by taking into account how our eyes biologically work and how our brains interpret the information they're receiving.
With the rise of television broadcasting the recurring problem of bandwidth-heavy innovations rapidly making the electromagnetic spectrum "crowded" started to become problematic, the practical bands of low long-range attenuation were becoming scarce and that would undoubteadly limit the amount of things we could have "over the air", which then lead the way to new bandwidth-saving techniques that aimed to reduce the required bandwidth as much as reasonably possible without affecting the perceived quality of the service to the same extent.
Since we're more likely to perceive contrast/luminosity differences than chromatic details, chroma-subsampling is simply the most basic way of throwing away some colouring information without sacrificing on luminosity resolution.
To make this simple to understand, we can see a normal image below alongside its RGB planes:
Now, the same image alongside its YUV/YCbCr planes:
We can easily simulate how it would look like if it didn't have its entirely chromatic information available by simply downscaling the chroma planes to 1/4 of their original resolution (0.5x scaling factor) and then upscaling them back (2x scaling factor).
It's important to see how much "blurrier" the chromatic planes look after this process, in other words, how much high frequency information they lost.
Can't notice the difference when you put it back to RGB? Well, that's the point. In any case, to make it possible for us to see, you can look at the image below, which is simply 128 plus the difference between the reference and the chroma subsampled version:
From this simulation it's probably reasonable to understand why chroma upscaling isn't usually a big concern, it's just that you're unlikely to notice the difference on real content.
With that in mind, we can proceed with the actual comparison. SSIM measurements can be ignored this time since the luma plane remains intact and therefore there are almost no significant gains to be made in structural similarity from switching the scaler.
Rosetta served as a somewhat unfaithful representation of what you should expect from anime, since usually speaking actual anime footage is far from being as detailed and therefore the scalers would score even closer overall. We can proceed with a live action example.
It's important to notice how the numerical difference between the scalers is much smaller when doing chroma measurements, and this is precisely why using a heavier scaler is usually not warranted unless the performance impact is negligible.
On my system, running KrigBilateral on Vulkan makes no difference whatsoever since I have a CPU decoding bottleneck at around ~1200 FPS with or without it on a 720p 8bit AVC file. Running on ra_d3d11 with d3d11va hardware decoding, I go from ~1800 FPS to ~1150 FPS going from lanczos to KrigBilateral. Running both FSRCNNX-8-0-4-1 and KrigBilateral at the same time gives me ~260 FPS on both d3d11 and Vulkan, with or without hwdec.
My personal recommendation is that you should test it yourself to see if you can notice KrigBilateral's difference during playback, and use it if it doesn't really matter resource-wise (in the case your GPU is fine either way). Another sensible decision would be to use it when you don't need luma upscaling, since then the only gains you can make are in chroma.
Well, we've been talking about upscaling up to now, but is there any difference when it comes to downscaling? Intrinsically, downscaling images should be an easier task considering all we need to do is draw the same curves with fewer points. A normal person would find it much harder to differentiate downscalers, all reasonable filters are usually good enough and it becomes a subjective problem of avoiding the artifacts you dislike the most. Different filters will produce different amounts of ringing, blocking and aliasing. Some prefer smooth images, others prefer sharp images... The problem here is that we can't compare the results to a reference and calculate which filter managed to get the closest to the ground truth like we did before for upscaling.
I chose to use Catmull Rom for all my downscales up to this point, and some of you might be asking yourselves why. Well, Catmull Rom is the sharpest BC-Spline that satisfies the B + 2C = 1 recommendation by Don P. Mitchell and Arun N. Netravali. Catrom can be seen as a sharper version of Mitchell, which is mpv's default downscaler with profile=gpu-hq.
Using Catrom in the upscaling tests was just my (reasonably arbitrary) personal preference at the time, Mitchell does produce even less artifacts but the end result is noticeably blurrier. This Stackoverflow answer from 2008 indicates some sort of survey was conducted within Hollywood and they ended up choosing windowed sincs for downscaling and Mitchell for upscaling.
For more detailed recommendations please read this special section in the ImageMagic documentation written by Nicolas Robidoux.
While downscaling evaluation is more of a qualitative thing, MATLAB comes with 3 no-reference quality metrics, BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator), NIQE (Natural Image Quality Evaluator) and PIQE (Perception based Image Quality Evaluator). BRISQUE correlates well with human perception because it's trained taking into consideration human subjectivity with a SVM (Support Vector Machine). NIQE and PIQE, however, are opinion-unaware and do not depend on human subjectivity.
NIQE learns its model from pristine and distorted images, while PIQE does not need any training. let's drop PIQE and focus on the other 2. Since humans usually like sharp images I expect BRISQUE to rank the scalers based almost solely on sharpness. NIQE should theoretically be more objective and actually rank them in an order that goes from "less distortions" to "more distortions" if it gets trained properly with pristine anime screenshots.
The problem with this approach is that we're assuming anime screenshots are pristine, and in fact they're not. Anime is usually delivered with 4:2:0 chroma-subsampling and a lot of lossy compression.
With all that in mind we can still try to evaluate this mathematically, despite the shortcomings.
I'll be using Chitanda for downscaling, this keyframe represents anime content very well, you can see she it's not as sharp as fanart (likely because it's an upscale
from some arbitrary resolution between 1280x720 and 1920x1080) but also not overly blurry either. I decided to use
The following image was downscaled using all the available filters in mpv:
As always, you can find all the resulting images in this repository.
We can see the results below:
As expected, BRISQUE really did almost perfectly rank them from sharpest to blurriest. The trained NIQE also put SSimDownscaler at the top, but the following filters aren't solely ranked from sharpest to blurriest.
Out of curiosity, I decided to calculate the structural similarity between the outputs of SSimDownscaler, Lanczos and Catmull Rom.
My original idea was using Waifu2x's training dataset to train NIQE, but I found out it isn't public. Then I figured out that I could simply use my anime screenshots folder (which had 162 anime screenshots inside from numerous different shows). The resulting model does not only give all the scalers better results but it also correlates better to BRISQUE results. I should probably train NIQE with pristine fanarts to make it more accurate.
Let's remember Lanczos scored slightly better than Catrom in the NIQE chart from the fractional upscaling tests when they were both coupled with a doubler (FSRCNNX or RAVU), but the scaling factor there was 0.75x, it is 0.5x here. As you increase the scaling factor (or decrease it, in the case of downscaling) you'll also make the artifacts a filter produces more visible.
The following image displays the difference between the Mitchell, Catrom and Lanczos filters (open it in a new tab at 100% scaling to see the ringing difference more easily):
I should probably repeat this for an image with more high frequency components like Rosetta, but for now Chitanda will suffice. It's important to insist the optimal filter highly depends on several factors such as the image itself and the scaling factor. To make any reasonably accurate claim I'd need to repeat this experiment on several images and average their results, which I probably won't do anytime soon considering the differences between the filters are honestly minimal.
My personal recommendation is: Stick to Mitchell as the default if you really dislike ringing. Switch to Lanczos if you want more perceived sharpness. And If you don't want either extreme, Catrom might be a decent compromise. Sigmoid downscaling could potentially dampen the ringing but as far as I'm aware mpv only supports sigmoidal light upscaling.
27/05/2023: Adding a disclaimer.
20/05/2023: Minor updates because I wasn't very happy with the text. Violet's average chart also updated to be done using normalised metrics (it used standardised metrics previously). I think this makes more sense because it treats all scores equally. Previously we'd have outliers carrying the average for some filters (like EWA_Robidouxsharp).
26/03/2022: Adding a new performance chart.
22/03/2022: Replacing the geometric mean with the normalised average of the standardised metrics. The geometric mean is great at averaging different metrics together when these metrics do not have similar ranges, as an arbitrary percentual improvement in any metric would warrant an identical improvement in the final number. However, a 20% (as an example) improvement in PSNR is not the same as a 20% improvement in SSIM. Standardising the scores (mean = 0 and standard deviation = 1) before averaging them puts all metrics on the same playing field. The final numbers were normalised into [0, 1] to prevent the table from having negative numbers (due to the standardisation, badly performing metrics end up with negative scores).
21/03/2022: Replacing NIQE with LPIPS in the upscaling test, LPIPS is a perception-based full-reference image quality metric. Gaussian75 was removed.
20/03/2022: Adding FSR, CAS and NVScaler. Fractional comparisons still missing (and I'm not exactly sure if I care enough to add them). I'm also hiding the live-action results from the page since these haven't been updated in ages.
23/05/2021: Adding Anime4K. Fractional comparisons still missing.
01/05/2021: Adding sigmoidal light and linear light comparisons.
18/03/2021: Updating FSRCNNX-8-0-4-1 to the 05/03/2021 version.
17/01/2021: Adding geometric mean for Violet.
06/12/2020: Downscaling outside linear light again for the upscaling tests.
01/12/2020: Fractional upscaling section updated with the newest versions of all the prescalers and more downscaling options. Some qualitative comparisons added to the downscaling section.
30/11/2020: All shaders updated to their current versions in the Violet upscaling tests. Waifu2x and NGU dropped. Comparing against Waifu2x is unfair due to the time it takes to actually compute a result, and I simply don't care about MadVR. Dropping these things allow me to greatly simplify the table generating process and I don't intend to add them back. NNEDI3 also dropped due to poor performance. PSNR-HA dropped due to being redundant alongside PSNR-HMA. VIF, MS-SSIM and NIQE added as new metrics. I still need to update Shuri, and I'll likely do it soon. But for now all old Violet tables can be found here.
28/11/2020: Improving scaling explanations considerably. Adding a new and improved downscaling section with no reference image quality metrics and a custom model fit for anime content.
26/11/2020: Changelog moved to the end of the page. Mitchell survey and some clarifications regarding downscaling added.
11/05/2019: KrigBilateral updated.
22/04/2019: FSRCNNX updated, removing the downscaling section for now since people aren't exactly going out of their ways to fix shitty upscales and there's already some information about it on the fractional tables.
21/04/2019: Preliminar fractional upscaling evaluatons added, focusing on the most relevant algorithms first.
18/04/2019: Ravu-zoom-r4 added, r2 and r3 were updated.
17/04/2019: PSNR-HA/HMA measurements are now done with wstep = 7 as recommended by igv.
16/04/2019: New FPS measurement methodology, files now play for 4 seconds and then I calculate how many frames were displayed based on which frame it stopped at after the time-frame has elapsed. --scalers-resizes-only was also removed since the half pixel shift is not noticeable during playback. Previous results had different numbers, but the order didn't exactly significantly change.
06/04/2019: Tables updated with newer versions of FSRCNNX again.
31/03/2019: MS-SSIM tables were removed. All remaining luma related tables were updated. New methodology consists of basically utilising a full-range greyscale PNG as input (which results in lower quantisation errors when compared to using limited range YCbCr). Ravu-zoom was added, FSRCNNX_x2_16-0-4-1 and Ravu-lite were updated.
27/03/2019: Updated PSNR-HA/HMA, you can expect to see ravu-ar and ravu-zoom soon. PSNR-HVS-M was removed since it's deprecated by the insertion of PSNR-HA.
15/03/2019: New versions of FSRCNNX.
08/03/2019: Adding another test image for chroma, and changed all chroma tests to upscale from a yuv4mpegpipe y4m file which makes me more confident they're accurate.07/03/2019: Adding Ravu-lite and fixing NGU-AA's half pixel shift. Huge thanks for Bjin for providing a user-shader to accomplish the latter =). FPS chart should be updated soon.
03/03/2019: Ravu-chroma and Ewa_Robidoux/Ewa_Robidouxsharp added.
27/02/2019: PSNR-HMA and PSNR-HA tables added.
25/02/2019: Initial version of the Chroma section is added. I'll take some time to slowly polish it with more information and different tests, and I'm also probably going to add PSNR-HMA to the RGB tests soon.
20/02/2019: IW-SSIM tables added.19/02/2019: MS-SSIM tables fixed, code had issues with RGB -> greyscale conversion.
18/02/2019: Lanczos added, and tables now display results from best to worst.
16/02/2019: Other EWA scalers were added.
22/12/2018: I found out a mistake in previous results, and have once again updated all quality comparison results.
I'm not certain about what was causing the problems, but turning off PNG filtering and debanding seems to have fixed it. I'll investigate which of the two was the culprit once I have the time.
Well, since I had to redo everything, I took this opportunity to improve the testing methodology, debanding was turned off to prevent loss of detail, and I used a lossless
YUV444p AVC encode as upscaling source instead of a YUV444p JPEG.
18/12/2018: Results have been updated with latest versions of FSRCNNX and Waifu2x. On top of that, I've included MS-SSIM and PSNR-HVS-M measurements. I've also included all NGU algorithms from MadVR since people are interested in them.