Last month Stability AI released Stable Diffusion XL 1.0 (SDXL) and opened its source code without requiring any special permissions to access it.
The release went largely unnoticed as the hype around generative artificial intelligence died down a bit. Everyone in the field of artificial intelligence is too busy with text-generating AI like ChatGPT. It is noteworthy that SDXL one of the first open source models that can natively generate 1024×1024 resolution images without fiddling, allowing much more detail to be displayed. In fact SDXL consists of two models: a base model and an additional refinement model that greatly improves detail, and since refinement does not increase speed, I strongly recommend using it if possible.
Lack of hype around SDXL doesn’t mean it’s boring. Now that the model has full support in the Python library diffusers from Hugging face with appropriate performance optimizations, we can use it because SDXL demos in diffusers simple and easy to set up:
from diffusers import DiffusionPipeline, AutoencoderKL
# загрузка SDXL and рафинёра
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix",
base = DiffusionPipeline.from_pretrained(
_ = base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
_ = refiner.to("cuda")
# генерация с использованием обеих моделей
high_noise_frac = 0.8
prompt = "an astronaut riding a horse"
negative_prompt = "blurry, bad hands"
image = base(
image = refiner(
I booted up a cloud VM with a new mid-range GPU L4 GPU (only $0.24 per hour with Spot instance on Google Cloud Platform) and got to work. WITH L4 GPU generating each 1024×1024 pixel image takes about 22 seconds, but you can only generate one image at a time on a mid-range GPU, unlike previous Stable Diffusion models because they used 100% of the GPU’s power, so a little more patience is needed. You can generate lower resolution images faster, but this is strongly discouraged because the results will be much worse.
diffusers also implemented support for two new features that I didn’t experiment with in my previous Stable Diffusion articles: weighing prompts And Dreambooth LoRA training and inference. Prompt weighting support with diffusers force library usage compel in python for more mathematical weighting of prompts. You can add any number
- to a given word to increase or decrease its “importance” in the resulting positional text embeddings and therefore in the final generation. You can also wrap phrases: for example, if you generate
пейзаж Сан-Франциско Сальвадора Дали, маслом на холстеand instead you get
фотореалистичный Сан-Францискоyou can wrap the artistic part of the prompt like –
пейзаж Сан-Франциско Сальвадора Дали (маслом на холсте)+++to make Stable Diffusion behave as expected. In my testing, wrappers eliminate most of the problems with prompts introduced in Stable Diffusion 2.0 and above, especially with increased image-to-query precision values (classifier-free guidance) (default Guidance_scale equal to 7.5; I prefer to use 13).
In all examples generated by the model LoRA this article uses
guide_scaleequal to 13.
LoRA the Explorer
But what matters most is support. Dreambooth LoRAwhich makes it possible to create “custom” Stable Diffusion models. dreambooth is a method of fine-tuning Stable Diffusion on a very small set of source images and a trigger keyword, allowing the “concept” from those images to be used in other contexts with a given keyword.
Training Stable Diffusion, even small versions of the model, requires many expensive GPUs to train for hours. This is where they appear LoRA: Instead, a small adapter for the visual model is trained, which can be done on one cheap GPU in 10 minutes, and the quality of the final model + LoRA is comparable to full fine-tuning (colloquially, when people refer to Stable Diffusion fine-tuning, usually this means creating LoRA ). Trained LoRA is a small discrete binary, making it easy to share with others or in repositories such as Civitai. A slight downside to LoRA is that you can only have one style in one LoRA adapter: you can combine multiple LoRAs to get the benefits of all, but that’s a delicate science.
Before Stable Diffusion LoRs became widespread, there was text inversion, which allows the text encoder to extract the concept, but takes hours to train and the results can be clunky. In a previous post, I trained text inversion on a comic dataset – Ugly Sonic, since it was not in the original Stable Diffusion dataset and would therefore be unique. The results of the final model were mixed.
I thought training LoRA on the Ugly Sonic dataset would be a good test of SDXL’s potential. Luckily, Hugging Face provides a script train_dreambooth_lora_sdxl.py for training LoRA using the basic SDXL model, which works out of the box, although I tweaked the parameters a bit. The generated images of “Ugly Sonic” from trained LoRA are, to put it mildly, much better and more coherent in several prompts.
Having achieved such success, I decided to repeat another experiment on text inversionby training LoRA on highly distorted, junk images, whose prompts were given as
неправильные, in the hope that LoRA can exploit the negative condition and avoid creating such images in order to generate better images. I wrote jupyter notebook to generate synthetic “wrong” images using self SDXLthis time using different prompt weights to get clearer examples of bad image types such as
плохие руки. It’s ironic that we need to use SDXL to create low quality high resolution images.
Base model + Refiner without LoRA (baseline)
Pipeline without LoRA using word
неправильноas a negative prompt (to make sure there is no placebo effect)
Pipeline with LoRAusing the word
неправильноas a negative prompt (our target result)
Each generation has the same seed (seed), so the composition of the photographs should be the same for all three generations, and the influence as
неправильного negative prompt, and LoRA compared to the base version should be very obvious.
Let’s start with a simple prompt from SDXL 0.9 demos:
Неправильный The prompt on the base model adds some foliage and depth to the forest, but LoRA adds a lot more: sharper lighting and shadows, more detailed foliage, and changes the perspective of the wolf to look at the camera, which is more interesting.
We can get a different perspective of a wolf with a similar photo composition by adding “
extreme closeup – extreme close-up” and reusing the same seed.
In this generation, LoRA has made much better texture, brightness and sharpness than others. But it is remarkable that simply adding the wrong clue changes the perspective.
Another good test case is food photography, especially the weird food photo on like these. Can SDXL+
неправильный LoRA handle non-Euclidean hamburgers with some prompt weighting to make sure they’re weird?
The answer is that it can’t, even after a few quick attempts at developing the prompt. However, this result is still interesting: the underlying SDXL seems to have taken the “alien” part of the prompt more literally than expected (and gave it a cute bun hat!), but LoRA understands the spirit of the prompt better by creating the “alien” burger that people would have a hard time eating, plus a shinier serving aesthetic.
A notable improvement in Stable Diffusion 2.0 is text readability. Can SDXL and the wrong LoRA make text even more readable, like newspaper covers with a lot of text?
Text legibility has definitely improved over Stable Diffusion 2.0, but it seems to be the same in all cases. Notably, LoRA has improved the typing on the covers: the page layout is more “modern” with different article layouts, and the titles have the correct relative font size. Meanwhile, the base model, even with
неправильным The prompt has a boring layout and for some reason on aged yellowish paper.
What about people? Does
неправильный LoRA infamous AI problem with hands, especially since we included many such examples in the LoRA training data? Let’s modify Taylor Swift’s presidential prompt from my first attempt with Stable Diffusion 2.0:
Look at Taylor’s right hand: in default SDXL it’s extremely unrealistic and actually gets worse when you add
неправильного Prompt, but LoRA fixed it! Color grading with LoRA is much better, with the shirt more distinctly white rather than yellowish white. However, don’t look closely at her hands in any of the generations: creating people with SDXL 1.0 is still difficult and unreliable!
Now it’s clear that
неправильно + LoRA is more interesting in every generation than just
неправильный negative prompt, so next we will just compare the base result with the LoR result. Here are some more examples comparing the base model with
неправильная LoRA, although I can’t guarantee its effectiveness on interfaces other than diffusers. All laptops used to create these images are available in this GitHub repoincluding base SDXL 1.0 + clarification +
неправильный LoRA notebook on Colab, which you can run on a free T4 GPU. And if you want to see the generated images used in this post in higher resolution, you can view them in publication source code.
What’s wrong with being wrong?
I’m actually not 100% sure what’s going on here. I thought that
неправильный receiving LoRA will simply improve the quality and clarity of the generated image, but it turned out that LoRA makes SDXL behave more intelligently and more closely match the meaning of the prompt. On a technical level, the negative hint sets the area of latent space where the diffusion process begins; this area is the same as for the base model using
неправильный negative prompt, and for LoRA, which uses
неправильный отрицательный промпт. My intuition is that LoRA is modifying this unwanted region of the vast multi-dimensional latent space to be more like the initial region, so it is unlikely that the normal generation will hit it and therefore be improved.
Training SDXL on bad images in order to improve it is technically a form of reinforcement learning based on people’s feedback (RLHF): The same technique used to make ChatGPT as powerful as it gets. While OpenAI uses reinforcement learning to improve the model based on positive user interactions and implicitly reduce negative behaviors, here I use negative user interactions (i.e., choosing deliberately bad images) to implicitly increase positive behavior. But with Dreambooth LoRA, you don’t need nearly as much input as large language models require.
There is still a lot of room for development for “negative LoRAs”: my synthetic dataset generation parameters could be significantly improved, and LoRA could be trained for longer. But so far I am very happy with the results and will be happy to test more negative LoRAs like merging with other LoRAs to see if it can improve them (especially
неправильная LoRA + Ugly Sonic LoRA!)
Believe it or not, this is just the tip of the iceberg. SDXL also now has ControlNet support for strict control over the overall shape and composition of generated images:
ControlNet can also be used with LoRA, but that’s enough to cover it in another article.