I made Stable Diffusion XL smarter by training it on bad AI generated images

Last month Stability AI released Stable Diffusion XL 1.0 (SDXL) and opened its source code without requiring any special permissions to access it.

The release went largely unnoticed as the hype around generative artificial intelligence died down a bit. Everyone in the field of artificial intelligence is too busy with text-generating AI like ChatGPT. It is noteworthy that SDXL one of the first open source models that can natively generate 1024×1024 resolution images without fiddling, allowing much more detail to be displayed. In fact SDXL consists of two models: a base model and an additional refinement model that greatly improves detail, and since refinement does not increase speed, I strongly recommend using it if possible.

Comparison of the relative quality of SD models.  Note the significant increase when using the refiner.  Link

Comparison of the relative quality of SD models. Note the significant increase when using the refiner. Link

Lack of hype around SDXL doesn’t mean it’s boring. Now that the model has full support in the Python library diffusers from Hugging face with appropriate performance optimizations, we can use it because SDXL demos in diffusers simple and easy to set up:

import torch
from diffusers import DiffusionPipeline, AutoencoderKL

# загрузка SDXL and рафинёра
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix",
                                    torch_dtype=torch.float16)
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
_ = base.to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
_ = refiner.to("cuda")

# генерация с использованием обеих моделей 
high_noise_frac = 0.8
prompt = "an astronaut riding a horse"
negative_prompt = "blurry, bad hands"

image = base(
    prompt=prompt,
    negative_prompt=negative_prompt,
    denoising_end=high_noise_frac,
    output_type="latent",
).images

image = refiner(
    prompt=prompt,
    negative_prompt=negative_prompt,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

I booted up a cloud VM with a new mid-range GPU L4 GPU (only $0.24 per hour with Spot instance on Google Cloud Platform) and got to work. WITH L4 GPU generating each 1024×1024 pixel image takes about 22 seconds, but you can only generate one image at a time on a mid-range GPU, unlike previous Stable Diffusion models because they used 100% of the GPU’s power, so a little more patience is needed. You can generate lower resolution images faster, but this is strongly discouraged because the results will be much worse.

diffusers also implemented support for two new features that I didn’t experiment with in my previous Stable Diffusion articles: weighing prompts And Dreambooth LoRA training and inference. Prompt weighting support with diffusers force library usage compel in python for more mathematical weighting of prompts. You can add any number + or - to a given word to increase or decrease its “importance” in the resulting positional text embeddings and therefore in the final generation. You can also wrap phrases: for example, if you generate пейзаж Сан-Франциско Сальвадора Дали, маслом на холстеand instead you get фотореалистичный Сан-Францискоyou can wrap the artistic part of the prompt like – пейзаж Сан-Франциско Сальвадора Дали (маслом на холсте)+++to make Stable Diffusion behave as expected. In my testing, wrappers eliminate most of the problems with prompts introduced in Stable Diffusion 2.0 and above, especially with increased image-to-query precision values ​​(classifier-free guidance) (default Guidance_scale equal to 7.5; I prefer to use 13).

In all examples generated by the model LoRA this article uses guide_scaleequal to 13.

LoRA the Explorer

But what matters most is support. Dreambooth LoRAwhich makes it possible to create “custom” Stable Diffusion models. dreambooth is a method of fine-tuning Stable Diffusion on a very small set of source images and a trigger keyword, allowing the “concept” from those images to be used in other contexts with a given keyword.

An example of how Dreambooth works;  Source

Training Stable Diffusion, even small versions of the model, requires many expensive GPUs to train for hours. This is where they appear LoRA: Instead, a small adapter for the visual model is trained, which can be done on one cheap GPU in 10 minutes, and the quality of the final model + LoRA is comparable to full fine-tuning (colloquially, when people refer to Stable Diffusion fine-tuning, usually this means creating LoRA ). Trained LoRA is a small discrete binary, making it easy to share with others or in repositories such as Civitai. A slight downside to LoRA is that you can only have one style in one LoRA adapter: you can combine multiple LoRAs to get the benefits of all, but that’s a delicate science.

Before Stable Diffusion LoRs became widespread, there was text inversion, which allows the text encoder to extract the concept, but takes hours to train and the results can be clunky. In a previous post, I trained text inversion on a comic dataset – Ugly Sonic, since it was not in the original Stable Diffusion dataset and would therefore be unique. The results of the final model were mixed.

Ugly Sonic, but not ugly enough.

Ugly Sonic, but not ugly enough.

I thought training LoRA on the Ugly Sonic dataset would be a good test of SDXL’s potential. Luckily, Hugging Face provides a script train_dreambooth_lora_sdxl.py for training LoRA using the basic SDXL model, which works out of the box, although I tweaked the parameters a bit. The generated images of “Ugly Sonic” from trained LoRA are, to put it mildly, much better and more coherent in several prompts.

Ugly Sonic, but with teeth.

Ugly Sonic, but with teeth.

WRONG!

Having achieved such success, I decided to repeat another experiment on text inversionby training LoRA on highly distorted, junk images, whose prompts were given as неправильные, in the hope that LoRA can exploit the negative condition and avoid creating such images in order to generate better images. I wrote jupyter notebook to generate synthetic “wrong” images using self SDXLthis time using different prompt weights to get clearer examples of bad image types such as размытые And плохие руки. It’s ironic that we need to use SDXL to create low quality high resolution images.

Examples of synthetic "wrong" images that unintentionally resemble punk rock album covers from the 2000s.

Examples of synthetic “wrong” images that unintentionally resemble punk rock album covers from the 2000s.

Other examples of synthetic misimages that focus on the uncanny valley aspect of modern AI-generated images, in which they look normal at first glance, but upon closer inspection, a growing horror is revealed.  That's why it's important to generate examples at full resolution (1024x1024).

Other examples of synthetic misimages that focus on aspect uncanny valley modern AI-generated images, in which they look normal at first glance, but upon closer inspection, a growing horror is revealed. That’s why it’s important to generate examples at full resolution (1024×1024).

I trained and uploaded LoRA into the base model Stable Diffusion XL (the refiner doesn’t need LoRA) and wrote jupyter notebookto compare the results with the given prompt from:

  • Base model + Refiner without LoRA (baseline)

  • Pipeline without LoRA using word неправильно as a negative prompt (to make sure there is no placebo effect)

  • Pipeline with LoRAusing the word неправильноas a negative prompt (our target result)

Each generation has the same seed (seed), so the composition of the photographs should be the same for all three generations, and the influence as неправильного negative prompt, and LoRA compared to the base version should be very obvious.

Let’s start with a simple prompt from SDXL 0.9 demos:

A wolf in Yosemite National Park, chilly nature documentary film photography

A wolf in Yosemite National Park, chilly nature documentary film photography - Wolf in Yosemite National Park, cold nature documentary filming

Неправильный The prompt on the base model adds some foliage and depth to the forest, but LoRA adds a lot more: sharper lighting and shadows, more detailed foliage, and changes the perspective of the wolf to look at the camera, which is more interesting.

We can get a different perspective of a wolf with a similar photo composition by adding “extreme closeup – extreme close-up” and reusing the same seed.

An extreme close-up of a wolf in Yosemite National Park, chilly nature documentary film photography

An extreme close-up of a wolf in Yosemite National Park, chilly nature documentary film photography - Extreme closeup of wolf in Yosemite national park, cold nature documentary footage

In this generation, LoRA has made much better texture, brightness and sharpness than others. But it is remarkable that simply adding the wrong clue changes the perspective.

Another good test case is food photography, especially the weird food photo on like these. Can SDXL+ неправильный LoRA handle non-Euclidean hamburgers with some prompt weighting to make sure they’re weird?

a large delicious hamburger (in the shape of five-dimensional alien geometry)++++, professional food photography

a large delicious hamburger (in the shape of five-dimensional alien geometry)++++, professional food photography - big delicious hamburger (shaped like 5D alien geometry)++++ professional food photography

The answer is that it can’t, even after a few quick attempts at developing the prompt. However, this result is still interesting: the underlying SDXL seems to have taken the “alien” part of the prompt more literally than expected (and gave it a cute bun hat!), but LoRA understands the spirit of the prompt better by creating the “alien” burger that people would have a hard time eating, plus a shinier serving aesthetic.

A notable improvement in Stable Diffusion 2.0 is text readability. Can SDXL and the wrong LoRA make text even more readable, like newspaper covers with a lot of text?

lossless PDF scan of the front page of the January 2038 issue of the Wall Street Journal featuring a cover story about (evil robot world domination)++ o (world domination of evil robots) ++

lossless PDF scan of the front page of the January 2038 issue of the Wall Street Journal featuring a cover story about (evil robot world domination)++ - Lossless PDF scan of the first page of the January 2038 issue of the Wall Street Journal with the cover story on (evil robot world domination)++

Text legibility has definitely improved over Stable Diffusion 2.0, but it seems to be the same in all cases. Notably, LoRA has improved the typing on the covers: the page layout is more “modern” with different article layouts, and the titles have the correct relative font size. Meanwhile, the base model, even with неправильным The prompt has a boring layout and for some reason on aged yellowish paper.

What about people? Does неправильный LoRA infamous AI problem with hands, especially since we included many such examples in the LoRA training data? Let’s modify Taylor Swift’s presidential prompt from my first attempt with Stable Diffusion 2.0:

USA President Taylor Swift (signing papers)++++, photo taken by the Associated Press

USA President Taylor Swift (signing papers)++++, photo taken by the Associated Press - US President Taylor Swift (signing documents)++++, photo Associated Press

Look at Taylor’s right hand: in default SDXL it’s extremely unrealistic and actually gets worse when you add неправильного Prompt, but LoRA fixed it! Color grading with LoRA is much better, with the shirt more distinctly white rather than yellowish white. However, don’t look closely at her hands in any of the generations: creating people with SDXL 1.0 is still difficult and unreliable!

Now it’s clear that неправильно + LoRA is more interesting in every generation than just неправильный negative prompt, so next we will just compare the base result with the LoR result. Here are some more examples comparing the base model with неправильной LoRA:

realistic human Shrek blogging at a computer workstation, hyperrealistic award-winning photo for vanity fair - Better hands, better lighting.  The clothes are more detailed and the background is more interesting.

realistic human Shrek blogging at a computer workstation, hyperrealistic award-winning photo for vanity fair - Hands are better, lighting is better. The clothes are more detailed and the background is more interesting.

pepperoni pizza in the shape of a heart, hyperrealistic award-winning professional food photography - Pepperoni is more detailed and has bubbles, less pepperoni around the edges, the crust is more crispy(?)

pepperoni pizza in the shape of a heart, hyperrealistic award-winning professional food photography - The pepperoni is more detailed and has bubbles, less pepperoni around the edges, the crust is more crispy(?)

presidential painting of realistic human Spongebob Squarepants wearing a suit, (oil on canvas)+++++ - SpongeBob has a nose again and more buttons on his suit

presidential painting of realistic human Spongebob Squarepants wearing a suit, (oil on canvas)+++++ -SpongeBob has a nose again and more buttons on his suit

San Francisco panorama attacked by (one massive kitten)++++, hyperrealistic award-winning photo by the Associated Press - LoRA is actually trying to follow the pro.

San Francisco panorama attacked by (one massive kitten)++++, hyperrealistic award-winning photo by the Associated Press - LoRA is actually trying to follow the prompt.

hyperrealistic death metal album cover featuring edgy moody realistic (human Super Mario)++, edgy and moody - Mario's proportions are more accurate and character lighting is sharper and clearer.

hyperrealistic death metal album cover featuring edgy moody realistic (human Super Mario)++, edgy and moody - Mario’s proportions are more accurate, and character lighting is sharper and clearer.

Here available неправильная LoRA, although I can’t guarantee its effectiveness on interfaces other than diffusers. All laptops used to create these images are available in this GitHub repoincluding base SDXL 1.0 + clarification + неправильный LoRA notebook on Colab, which you can run on a free T4 GPU. And if you want to see the generated images used in this post in higher resolution, you can view them in publication source code.

What’s wrong with being wrong?

I’m actually not 100% sure what’s going on here. I thought that неправильный receiving LoRA will simply improve the quality and clarity of the generated image, but it turned out that LoRA makes SDXL behave more intelligently and more closely match the meaning of the prompt. On a technical level, the negative hint sets the area of ​​latent space where the diffusion process begins; this area is the same as for the base model using неправильный negative prompt, and for LoRA, which uses неправильный отрицательный промпт. My intuition is that LoRA is modifying this unwanted region of the vast multi-dimensional latent space to be more like the initial region, so it is unlikely that the normal generation will hit it and therefore be improved.

Training SDXL on bad images in order to improve it is technically a form of reinforcement learning based on people’s feedback (RLHF): The same technique used to make ChatGPT as powerful as it gets. While OpenAI uses reinforcement learning to improve the model based on positive user interactions and implicitly reduce negative behaviors, here I use negative user interactions (i.e., choosing deliberately bad images) to implicitly increase positive behavior. But with Dreambooth LoRA, you don’t need nearly as much input as large language models require.

There is still a lot of room for development for “negative LoRAs”: my synthetic dataset generation parameters could be significantly improved, and LoRA could be trained for longer. But so far I am very happy with the results and will be happy to test more negative LoRAs like merging with other LoRAs to see if it can improve them (especially неправильная LoRA + Ugly Sonic LoRA!)

Believe it or not, this is just the tip of the iceberg. SDXL also now has ControlNet support for strict control over the overall shape and composition of generated images:

Examples of SDXL generation using ControlNet showing the (former) Twitter/X logo.

Examples of SDXL generation using ControlNet showing the (former) Twitter/X logo.

ControlNet can also be used with LoRA, but that’s enough to cover it in another article.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *