KandiSuperRes Flash - an updated model for increasing image resolution

In April of this year, we released the Kandinsky 3.1 model, which supports many different modes, including the ability to generate 4K images using the KandiSuperRes diffusion model. You can read more about the architecture and results in this article. The model allows generating sharper images in high resolution, but does not eliminate the artifacts obtained at the generation stage using the Kandinsky 3.1 model. To eliminate these shortcomings, we developed the KandiSuperRes Flash model, which improves the image, makes it more aesthetically pleasing, and at the same time increases the resolution by two times.

Example of resolution increase using KandiSuperRes Flash

Description of the conveyor

The updated KandiSuperRes Flash pipeline contains: a distilled version of KandiSuperRes and a distilled model of Kandinsky 3.0 Flash. We have already written in detail about the distillation of diffusion models in our previous article. There we described how we trained Kandinsky 3.0 Flash. By analogy, we trained a distilled version of the resolution enhancement model. We used the downsample part of the U-Net from the KandiSuperRes model as a discriminator, and Wasserstein Loss as a loss function. We noticed that this approach not only helps speed up image generation, but also allows us to create sharper and more detailed images thanks to the discriminator and training in GAN mode. Below is a comparison of the KandiSuperRes model and its distilled version.

Comparison of the KandiSuperRes model and its distilled version

Through experimentation we found that the Kandinsky 3.0 Flash distilled model can be used as a refiner, generating in 1 or 2 steps, depending on how much we want to change the image.

The result of Kandinsky 3.0 Flash as a refiner

We combined these two models into a single pipeline called KandiSuperRes Flash. The first stage doubles the resolution in 4 pixel diffusion steps, and the second stage refining the image in 1 latent diffusion step (noising up to step 229 and back). The inference time of the entire pipeline is 6 seconds on H100 when upscaling from 1K to 2K.

Results and examples

Examples of generations using KandiSuperRes Flash

To understand how much the KandiSuperRes Flash model improves images, we conducted a side-by-side (SBS) comparison with the Kandinsky 3.1 model. SBS was conducted on a fixed query basket of 2100 prompts (100 prompts for each of 21 categories). Each generation was evaluated for visual quality (which of the two images do you like better). You can read more about the SBS methodology in the article Kandinsky 3.0. The results of the comparison are shown in the graphs below. Annotators were twice as likely to select images after upscaling with KandiSuperRes Flash than before upscaling (28% vs. 15%). In 19% of cases, annotators liked both images, and in 37% of cases, neither image was selected. However, this can be explained by the fact that these images contained strong artifacts that were not corrected by the upscaling model.

Comparison of models by visual image quality

Comparison of visual quality of images by topic

Infinite super resolution

Since KandiSuperRes Flash now not only increases the resolution and clarity of the image, but also details it, drawing some details, it became possible to infinitely enlarge the image (up to x16 and even more).

Conclusion

We have introduced a new version of the KandiSuperRes Flash resolution enhancement model, which has become much better at generating images. KandiSuperRes Flash now not only increases clarity, but also corrects artifacts, draws in details, and improves the aesthetics of the image. And one of the most important advantages is the ability to use the model in the “infinite super resolution” mode. The code and weights can be found at Github And HuggingFace.

Team of authors: Anastasia Maltseva, Vladimir Arkhipkin, Nikolay Gerasimenko, Andrey Kuznetsov and the head of the scientific group Sber AI Research Denis Dimitrov.