This work aims to mathematically quantify image quality and performance of different real-time upscaling algorithms.
If all you want is to look at the results, follow this link.
Like its name implies, upscaling is simply the act of increasing the scale of something. In the digital signal processing case, taking discrete information in a given scale and calculating points between the existing samples. The easiest and most classic way of doing this is through simple linear interpolation, if you want to find a value between two points you can simply draw a line between them and linearly find any value in this line. If you wanted to find the value that's placed exactly in the middle of those 2 discrete points, you could simply do an arithmetic mean with their values to find your answer.
Linear interpolation can be done in a plane, through both axis, creating what we call "bilinear" interpolation. Bilinear interpolation is the simplest interpolation algorithm, easiest to calculate and unironically the most widespread one due to the fact that it's extremely simple to implement.
But can something as simple as just drawing a line between 2 known discrete points give us good results? It depends entirely on how the original information looked like. We can, however, increase the complexity of our upscaling algorithm increasing the amount of information it takes into consideration to find values between the original discrete points. In this page we'll evaluate what we can do differently, how much better our results can get and how it affects our performance.
A relatively simple way of taking more information into consideration is fitting a curve into more than 2 discrete points. If we connect 2 points directly and only have their values to work with our end result is always going to be a straight line connecting both, but if we can also look at the other neighbouring elements we might be able to fit actual curves and therefore calculate the in-between points more accuraterly. This is implemented through convolution, using a specific kernel that represents a corresponding filter.
The math behind this comes from digital signal processing, to perfectly recover the analogue counterpart of a digitised signal you need an ideal low-pass filter of gain Ts and cutoff frequency between the Nyquist frequency and the Sampling frequency. An ideal low pass filter looks like a literal rectangle in the frequency domain, which translates into a sinc in the time domain. Multiplying signals in the time domain is equivalent to convoluting them through time, and we usually choose to use convolutions because they're faster than calculating the Digital Fourier Transform of the signal.
With that in mind, when it comes to images we don't have a "time" axis, but rather simply x and y. Everything else works the same though, but since the "ideal" reconstruction filter would require us to perform a convolution using an infinite sinc (means taking into consideration every single point in the image and others that don't even exist), several approximations and windows have been proposed. A window is nothing more than a way of limiting how big the kernel is, and since the kernel itself can't be a perfect infinite sinc tweaks have to be made in order to achieve good results. The Lanczos filter is a popular approxiation of the sinc function. The Lanczos filter is simply the sinc function windowed by the central lobe of a larger sinc function.
For a more in-depth explanation of scaling filters, you can check Imagemagick's documentation.
Machine learning comes as a powerful successor to classical methods due to the fact that you do not need to rely solely on what you can currently retrieve from the image, but also on a set of trained parameters generated from analysing multiple images and "learning" what it should do to go from "low resolution" to "high resolution".
If you're interested in this topic I highly recommend reading my blog post about ML upscaling.
There are multiple different machine learning based image upscalers, we'll stick to those that are easily testable as mpv user shaders.
We'll test the following shaders:
Performance measurements were done upscaling an animation video encoded into 8bit AVC from 1280x720 to 2560x1440. Utilising mpv with a benchmarking profile, all settings from profile=gpu-hq, with Vulkan as the renderer and hardware decoding turned off. The machine used has an Ivy Bridge i5 3470, a Polaris RX 470 and user shaders were always compute shaders when possible. The methodology consists of letting mpv play the file for 4 seconds and then seeing at which frame it stopped.
Quality measurements were done in 2 test images, one from animation and another one from live action. Debanding was turned off to prevent loss of fine detail.
The reasoning behind this comes from the fact that the live action image has more high frequency components that are harder to "restore".
In order to measure how "good" each algorithm is, the test images were first converted to grayscale with imagemagick:
magick convert image.png -colorspace gray image-gray.png
They were then downscaled to a quarter of their resolutions (0.5x scaling factor) using the Catmull Rom filter:
magick convert image_gray.png -filter catrom -resize 50% downscaled.png
The images were then upscaled back by mpv:
mpv --no-config --profile=gpu-hq --deband=no --no-hidpi-window-scale --window-scale=2.0 --pause=yes --screenshot-format=png --no-scaler-resizes-only --scale=filter --glsl-shader=shader downscaled.png
Anime4K has a modular design and you can find instructions here. I used the "Remain as faithful to the original while enhancing details" configuration.
Since mpv writes PNGs in the rgb24 format, all screenshots were converted to greyscale using imagemagick as described above. This doesn't change how they're shown on screen, but changes them from 1920x1080x3 to 1920x1080x1.
Performance was evaluated simply in frames per second, while quality was evaluated in PSNR, SSIM, MS-SSIM, IW-SSIM, PSNR-HMA, VIF and NIQE.
With the exception of NIQE, all metrics are full-reference metrics and they compare the "distorted" image to what we call the "ground truth". For more information about NIQE please refer to the downscaling section.
All metrics were calculated on Matlab using their reference implementations. PSNR-HMA has its wstep set to 7.
This entire process should be easily reproducible.
The following test image was used for the line art upscaling tests:
Before we continue to the results, you can find all images in this repository. Including the greyscale variant and its downscaled counterpart.
We can see the results below:
The following image was used for the live-action upscaling testing:
Again, you can find all the images in this repository.
We can see the results below:
We can clearly see that in general, machine learning based scalers tend to give better quality, though as expected they're also more computationally expensive. We can see that FSRCNNX is the highest scoring mpv scaler, followed by RAVU. A previous version of this page included NNEDI3 in the charts, but it has been dropped due to poor performance. NNEDI3 can be pretty much deprecated, FSRCNNX and RAVU are both better.
Between the simple scaling filters, Lanczos performs the best in almost every single table. There are a few things to consider though, sigmoidal upscaling can reduce ringing artifacts and elliptical/cylindrical scaling is supposed to considerably reduce aliasing at the expense of being slightly blurrier. These phenomena are more visible as we increase the scaling factor, this test is not conclusive in any way whatsoever.
While I'm providing numerical test results to mathematically evaluate how those different algorithms perform, I strongly advise taking a look on how the upscaled images actually end up looking like. The viewer has his/her own preference regarding different drawbacks like aliasing, ringing or blurring. Some people might prefer an algorithm that scores relatively poorly when compared to sharper choices, but you should please yourself while consuming your media.
The following image is the result of a survey conducted by Don P. Mitchell and Arun N. Netravali, alongside 9 other digital image processing experts with the intention of classifying the bicubic filters through the B and C parameters.
The variant known as "Mitchell-Netravali" nowadays is the filter you get with B=C=1/3, and it is mpv's
Taking into consideration anime productions are still done at arbitrary resolutions between 1280x720 and 1920x1080 (with most below 1600x900, check Anibin for details) and that most users are still watching them on FHD 1920x1080 displays, the most common scenario when there's any luma upscaling (or rgb upscaling if you want to nitpick about it, since mpv merges luma and chroma right after doubling chroma) being performed by the player is arguably 720p->1080p, which corresponds to a 1.5x scaling factor. With the recent addition of ravu-zoom to our ever growing arsenal of user shaders, and the need to find out which dscales work better alongside the doublers, adding a new series of measurements seems reasonable.
The methodology is straightforward, violet-gray.png was downscaled to 1280x720 with imagemagick:
magick convert violet-gray.png -filter catrom -resize 1280x720 downscaled.png
This is the equivalent of downscaling with mpv using
The downscaled image was then brought back up to 1920x1080 using the several shaders. For the doublers, the following downscale was
The following tables display the results. The filter specified after the doubler is the downscaling filter used from 2560x1440 to 1920x1080.
You can find all the images in this repository.
Well, we can see that VIF clearly likes RAVU more than the other metrics. We can also see that the mitchell filter scores poorly when compared to lanczos or catrom regardless of doubler, and this can be easily attributed to the fact that it is significantly blurrier than the other 2. Lanczos is the sharpest downscaling filter between the 3 sane options tested here, and at the 0.75x scaling factor (1440p->1080p) its pronounced ringing artifacts are not "bad" enough to make it rank lower than catrom. Catrom is almost as sharp but it introduces less ringing to the output.
Overall the tables tell a story that clearly puts FSRCNNX on top, and more specifically the LineArt variant. RAVU-zoom is trading blows with RAVU-lite depending on the metric, and SSSR also shows up as a good "3rd place" contender. I didn't feel the need to include all the simple filters, Lanczos scored the best in the 2x upscaling test so it is the only one showing up again here.
I currently can't conduct any performance measurements, the quarantine left me without access to my desktop computer. Doing it on my laptop's Intel iGPU is not worth it.
The sRGB colour space does not have a linear gamma curve, it's actually (approximately) a power 2.2 function to compensante how the human visual system works giving more quantisation steps to the darker tones. This means scaling in sRGB gamma "blends" perceptual values instead of doing the arithmetic in linear light and then converting it back. This is not always bad, but it can create accentuated ringing artifacts towards the brighter tones. Doing it in linear light accentuates dark ringing artifacts though, which is the exact opposite and usually looks even worse in my opinion.
Taking this into consideration, the gimmicky yet clever technique of scaling in sigmoidal light was "developed". Sigmoid light treats dark and bright overshoots the same, as the quantisation precision is equal in both extremes.
The following test compares linear light, sRGB light and sigmoidal light on a simple 2x orthogonal upscaling test with the Lanczos filter.
A new test image, Kanao, was reused. It was downscaled to 960x540 so we can upscale it back with a 2x scaling factor. The box filter was used to avoid creating pixels out of the original range and to avoid ringing artifacts in the downscaling process.
The sRGB light upscaling was done with:
magick convert kanao_box.png -filter lanczos -resize 200% kanao_srgb.png
The linear light upscaling was done with:
magick convert kanao_box.png -colorspace RGB -filter lanczos -resize 200% -colorspace sRGB kanao_linear.png
The sigmoidal light upscaling was done with:
magick convert kanao_box.png -colorspace RGB +sigmoidal-contrast 6.5,50% -filter lanczos -resize 200% -sigmoidal-contrast 6.5,50% -colorspace sRGB kanao_sigmoid.png
You can find all the images in this repository.
As we can see, all three relevant metrics seem to agree sigmoidal light is better. The gpu-hq profiles comes with
The following image shows the difference (images have been scaled up with nearest neighbour for easier visualisation, open in a new tab at 100% scaling to see it).
Chroma subsampling is a technique utilised to save bitrate/filesize without strongly sacrificing perceived quality, by taking into account how our eyes biologically work and how our brains interpret the information they're receiving.
With the rise of television broadcasting the recurring problem of bandwidth-heavy innovations rapidly making the electromagnetic spectrum "crowded" started to become problematic, the practical bands of low long-range attenuation were becoming scarce and that would undoubteadly limit the amount of things we could have "over the air", which then lead the way to new bandwidth-saving techniques that aimed to reduce the required bandwidth as much as reasonably possible without affecting the perceived quality of the service to the same extent.
Since we're more likely to perceive contrast/luminosity differences than chromatic details, chroma-subsampling is simply the most basic way of throwing away some colouring information without sacrificing on luminosity resolution.
To make this simple to understand, we can see a normal image below alongside its RGB planes:
Now, the same image alongside its YUV/YCbCr planes:
We can easily simulate how it would look like if it didn't have its entirely chromatic information available by simply downscaling the chroma planes to 1/4 of their original resolution (0.5x scaling factor) and then upscaling them back (2x scaling factor).
It's important to see how much "blurrier" the chromatic planes look after this process, in other words, how much high frequency information they lost.
Can't notice the difference when you put it back to RGB? Well, that's the point. In any case, to make it possible for us to see, you can look at the image below, which is simply 128 plus the difference between the reference and the chroma subsampled version:
From this simulation it's probably reasonable to understand why chroma upscaling isn't usually a big concern, it's just that you're unlikely to notice the difference on real content.
With that in mind, we can proceed with the actual comparison. SSIM measurements can be ignored this time since the luma plane remains intact and therefore there are almost no significant gains to be made in structural similarity from switching the scaler.
Rosetta served as a somewhat unfaithful representation of what you should expect from anime, since usually speaking actual anime footage is far from being as detailed and therefore the scalers would score even closer overall. We can proceed with a live action example.
It's important to notice how the numerical difference between the scalers is much smaller when doing chroma measurements, and this is precisely why using a heavier scaler is usually not warranted unless the performance impact is negligible.
On my system, running KrigBilateral on Vulkan makes no difference whatsoever since I have a CPU decoding bottleneck at around ~1200 FPS with or without it on a 720p 8bit AVC file. Running on ra_d3d11 with d3d11va hardware decoding, I go from ~1800 FPS to ~1150 FPS going from lanczos to KrigBilateral. Running both FSRCNNX-8-0-4-1 and KrigBilateral at the same time gives me ~260 FPS on both d3d11 and Vulkan, with or without hwdec.
My personal recommendation is that you should test it yourself to see if you can notice KrigBilateral's difference during playback, and use it if it doesn't really matter resource-wise (in the case your GPU is fine either way). Another sensible decision would be to use it when you don't need luma upscaling, since then the only gains you can make are in chroma.
Well, we've been talking about upscaling up to now, but is there any difference when it comes to downscaling? Intrinsically, downscaling images should be an easier task considering all we need to do is draw the same curves with fewer points. A normal person would find it much harder to differentiate downscalers, all reasonable filters are usually good enough and it becomes a subjective problem of avoiding the artifacts you dislike the most. Different filters will produce different amounts of ringing, blocking and aliasing. Some prefer smooth images, others prefer sharp images... The problem here is that we can't compare the results to a reference and calculate which filter managed to get the closest to the ground truth like we did before for upscaling.
I chose to use Catmull Rom for all my downscales up to this point, and some of you might be asking yourselves why. Well, Catmull Rom is the sharpest BC-Spline that satisfies the B + 2C = 1 recommendation by Don P. Mitchell and Arun N. Netravali. Catrom can be seen as a sharper version of Mitchell, which is mpv's default downscaler with profile=gpu-hq.
Using Catrom in the upscaling tests was just my (reasonably arbitrary) personal preference at the time, Mitchell does produce even less artifacts but the end result is noticeably blurrier. This Stackoverflow answer from 2008 indicates some sort of survey was conducted within Hollywood and they ended up choosing windowed sincs for downscaling and Mitchell for upscaling.
For more detailed recommendations please read this special section in the ImageMagic documentation written by Nicolas Robidoux.
While downscaling evaluation is more of a qualitative thing, MATLAB comes with 3 no-reference quality metrics, BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator), NIQE (Natural Image Quality Evaluator) and PIQE (Perception based Image Quality Evaluator). BRISQUE correlates well with human perception because it's trained taking into consideration human subjectivity with a SVM (Support Vector Machine). NIQE and PIQE, however, are opinion-unaware and do not depend on human subjectivity.
NIQE learns its model from pristine and distorted images, while PIQE does not need any training. let's drop PIQE and focus on the other 2. Since humans usually like sharp images I expect BRISQUE to rank the scalers based almost solely on sharpness. NIQE should theoretically be more objective and actually rank them in an order that goes from "less distortions" to "more distortions" if it gets trained properly with pristine anime screenshots.
The problem with this approach is that we're assuming anime screenshots are pristine, and in fact they're not. Anime is usually delivered with 4:2:0 chroma-subsampling and a lot of lossy compression.
With all that in mind we can still try to evaluate this mathematically, despite the shortcomings.
I'll be using Chitanda for downscaling, this keyframe represents anime content very well, you can see she it's not as sharp as fanart (likely because it's an upscale
from some arbitrary resolution between 1280x720 and 1920x1080) but also not overly blurry either. I decided to use
The following image was downscaled using all the available filters in mpv:
As always, you can find all the resulting images in this repository.
We can see the results below:
As expected, BRISQUE really did almost perfectly rank them from sharpest to blurriest. The trained NIQE also put SSimDownscaler at the top, but the following filters aren't solely ranked from sharpest to blurriest.
Out of curiosity, I decided to calculate the structural similarity between the outputs of SSimDownscaler, Lanczos and Catmull Rom.
My original idea was using Waifu2x's training dataset to train NIQE, but I found out it isn't public. Then I figured out that I could simply use my anime screenshots folder (which had 162 anime screenshots inside from numerous different shows). The resulting model does not only give all the scalers better results but it also correlates better to BRISQUE results. I should probably train NIQE with pristine fanarts to make it more accurate.
Let's remember Lanczos scored slightly better than Catrom in the NIQE chart from the fractional upscaling tests when they were both coupled with a doubler (FSRCNNX or RAVU), but the scaling factor there was 0.75x, it is 0.5x here. As you increase the scaling factor (or decrease it, in the case of downscaling) you'll also make the artifacts a filter produces more visible.
The following image displays the difference between the Mitchell, Catrom and Lanczos filters (open it in a new tab at 100% scaling to see the ringing difference more easily):
I should probably repeat this for an image with more high frequency components like Rosetta, but for now Chitanda will suffice. It's important to insist the optimal filter highly depends on several factors such as the image itself and the scaling factor. To make any reasonably accurate claim I'd need to repeat this experiment on several images and average their results, which I probably won't do anytime soon considering the differences between the filters are honestly minimal.
My personal recommendation is: Stick to Mitchell as the default if you really dislike ringing. Switch to Lanczos if you want more perceived sharpness. And If you don't want either extreme, Catrom might be a decent compromise. Sigmoid downscaling could potentially dampen the ringing but as far as I'm aware mpv only supports sigmoidal light upscaling.
23/05/2021: Adding Anime4K. Fractional comparisons still missing.
01/05/2021: Adding sigmoidal light and linear light comparisons.
18/03/2021: Updating FSRCNNX-8-0-4-1 to the 05/03/2021 version.
17/01/2021: Adding geometric mean for Violet.
06/12/2020: Downscaling outside linear light again for the upscaling tests.
01/12/2020: Fractional upscaling section updated with the newest versions of all the prescalers and more downscaling options. Some qualitative comparisons added to the downscaling section.
30/11/2020: All shaders updated to their current versions in the Violet upscaling tests. Waifu2x and NGU dropped. Comparing against Waifu2x is unfair due to the time it takes to actually compute a result, and I simply don't care about MadVR. Dropping these things allow me to greatly simplify the table generating process and I don't intend to add them back. NNEDI3 also dropped due to poor performance. PSNR-HA dropped due to being redundant alongside PSNR-HMA. VIF, MS-SSIM and NIQE added as new metrics. I still need to update Shuri, and I'll likely do it soon. But for now all old Violet tables can be found here.
28/11/2020: Improving scaling explanations considerably. Adding a new and improved downscaling section with no reference image quality metrics and a custom model fit for anime content.
26/11/2020: Changelog moved to the end of the page. Mitchell survey and some clarifications regarding downscaling added.
11/05/2019: KrigBilateral updated.
22/04/2019: FSRCNNX updated, removing the downscaling section for now since people aren't exactly going out of their ways to fix shitty upscales and there's already some information about it on the fractional tables.
21/04/2019: Preliminar fractional upscaling evaluatons added, focusing on the most relevant algorithms first.
18/04/2019: Ravu-zoom-r4 added, r2 and r3 were updated.
17/04/2019: PSNR-HA/HMA measurements are now done with wstep = 7 as recommended by igv.
16/04/2019: New FPS measurement methodology, files now play for 4 seconds and then I calculate how many frames were displayed based on which frame it stopped at after the time-frame has elapsed. --scalers-resizes-only was also removed since the half pixel shift is not noticeable during playback. Previous results had different numbers, but the order didn't exactly significantly change.
06/04/2019: Tables updated with newer versions of FSRCNNX again.
31/03/2019: MS-SSIM tables were removed. All remaining luma related tables were updated. New methodology consists of basically utilising a full-range greyscale PNG as input (which results in lower quantisation errors when compared to using limited range YCbCr). Ravu-zoom was added, FSRCNNX_x2_16-0-4-1 and Ravu-lite were updated.
27/03/2019: Updated PSNR-HA/HMA, you can expect to see ravu-ar and ravu-zoom soon. PSNR-HVS-M was removed since it's deprecated by the insertion of PSNR-HA.
15/03/2019: New versions of FSRCNNX.08/03/2019: Adding another test image for chroma, and changed all chroma tests to upscale from a yuv4mpegpipe y4m file which makes me more confident they're accurate.
07/03/2019: Adding Ravu-lite and fixing NGU-AA's half pixel shift. Huge thanks for Bjin for providing a user-shader to accomplish the latter =). FPS chart should be updated soon.
03/03/2019: Ravu-chroma and Ewa_Robidoux/Ewa_Robidouxsharp added.
27/02/2019: PSNR-HMA and PSNR-HA tables added.
25/02/2019: Initial version of the Chroma section is added. I'll take some time to slowly polish it with more information and different tests, and I'm also probably going to add PSNR-HMA to the RGB tests soon.20/02/2019: IW-SSIM tables added.
19/02/2019: MS-SSIM tables fixed, code had issues with RGB -> greyscale conversion.
18/02/2019: Lanczos added, and tables now display results from best to worst.
16/02/2019: Other EWA scalers were added.
22/12/2018: I found out a mistake in previous results, and have once again updated all quality comparison results.
I'm not certain about what was causing the problems, but turning off PNG filtering and debanding seems to have fixed it. I'll investigate which of the two was the culprit once I have the time.
Well, since I had to redo everything, I took this opportunity to improve the testing methodology, debanding was turned off to prevent loss of detail, and I used a lossless YUV444p AVC encode as upscaling source instead of a YUV444p JPEG.
18/12/2018: Results have been updated with latest versions of FSRCNNX and Waifu2x. On top of that, I've included MS-SSIM and PSNR-HVS-M measurements. I've also included all NGU algorithms from MadVR since people are interested in them.