Training a Stable Diffusion Model with Text Inversion Using Diffusers

Browsing the Internet for interesting technologies in the field of neural networks and various arts, I came across a post on Twitter in which Suraj Patil announced the possibility of training the Stable Diffusion model of text inversion using only 3-5 images.

This news spread very quickly in the vastness of the English-speaking community (although not everyone knows what it is), but in the vastness of the ru community, even for several days there was not a single mention of this. Therefore, I decided to talk about it, as well as provide code that you can test yourself. Some questions can be answered at the end of the article.

What is text inversion?

I will try to explain it in a more understandable language.

Text inversion allows you to train a model of a new “concept” that you can associate with a “word” without changing the weights, instead fine-tuning the text embedding vectors.

So, for example, I added a bunch of photos of a plush penguin. I was then able to ask the model to, for example, create a drawing my particular plush penguin . Or generate a photo my particular plush penguin sitting on top of a mountain. The goal is to allow you to import a specific idea/concept from photographs/images that were not generated by the model and learn how to represent them. It can also be used to represent styles. So let’s say you’ve uploaded a bunch of drawings by a certain artist, then you can ask them to generate images in that particular style.

Using text inversion

Now I will show how you can train Stable Diffusion with text inversion. This code can also be found in github repositories or use this colab.

Before starting work, I advise you to register in HuggingFace and get access token with access setting “write“. Now you can proceed.

In a colab, the code is located in cells that can be run individually, but I will try to explain the main points that often cause problems.

  • First you need to install the necessary libraries and go to HuggingFace. For this, we need an access token. Signing into HuggingFace will allow you to save your trained model and share it with the Stable Diffusion content library if needed.

  • In chapter “Settings for teaching your new concept” you need to select a checkpoint on which all training will be tied. By default, there is a checkpoint “CompVis/stable-diffusion-v1-4“. You can specify any other

  • Next, work with the dataset begins. The official text inversion colab allows you to use direct links to images, but for my project with the inheritance of the style of the artist Ilya Kuvshinov, I used Google disk on which I have about 1000 images of art by this artist (I used only 30 images). To do this, I logged into my colab account and using the shutil module copied the folder from my Google drive:

from google.colab import drive
drive.mount('/content/drive')
import shutil
shutil.copytree('/content/drive/My Drive/ForSD/', '/content/kuvshinov')
import glob

path="kuvshinov/"

def download_image(filename):
  return Image.open(path+filename).convert("RGB")

images = list(filter(None,[download_image(filename) for filename in os.listdir(path)]))
save_path = "./my_concept"
if not os.path.exists(save_path):
  os.mkdir(save_path)
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]
  • Now we can move on to setting up the model. In chapter “what_to_teach” you can select an object (allows you to train the model with new objects) or a style (as in my case, the style of the artist’s images). In the “placeholder_token“you must specify a name by which you can later use it on the trained model. I have this . In chapter “initializer_token“you need to specify 1 word that will show what your model is about. I have this kuvshinov.

  • Now we can start training the model. To do this, simply run the cells located in this section “Teach the model a new concept (fine-tuning with textual inversion)“. You can also see what parameters are used there, and if you are interested, even edit them. initializer_token. Most often, this error is thrown by one and also a cell, so you can comment out an unnecessary piece of code like mine. Name of this cellGet token ids for our placeholder and initializer token. This code block will complain if initializer string is not a single token“. After that, you can continue to run the code

token_ids = tokenizer.encode(initializer_token, add_special_tokens=False)
# Check if initializer_token is a single token or a sequence of tokens
#if len(token_ids) > 1:
#    raise ValueError("The initializer token must be a single token.")

initializer_token_id = token_ids[0]
placeholder_token_id = tokenizer.convert_tokens_to_ids(placeholder_token)
  • Model training takes an average of +/- 3 hours. After that, you can test the resulting model and publish it to the Stable Diffusion Content Library

  • In order to test and publish the model, you need to go to the “Run the code with your newly trained model“.in cell”Save your newly created concept to the library of concepts” you can specify the name of your concept, as well as choose to publish the model or not. If you decide to publish it, then in the line hf_token_write you must specify the access token that you specified above when logging into HuggingFace. Be sure to check what access this token has. There are only 2 of them: reading and writing (write). We need a write token.

  • After that, you can run all the other cells one by one. In the last cell in the row prompt you must provide your text prompt in English and indicate your placeholder_token. For example: a grafitti in a wall with a on it

Questions and answers

  1. How many resources are required to train the model?

    To train my model, I used 30 1024×1024 images, but you can also use fewer images. The developers of this technology for their tests chose from 3 to 5 images

  2. Does the learning rate depend on the size of the dataset?

    No, it doesn’t. I tried using both 5 and 30 images and on average the learning rate remained the same

  3. “To save this concept for reuse, download learned_embeds.binfile or save it to the concept library.” Does this mean I can use this on my local diffusion stable that I’m already running? How do I do this, where will this .bin file go, and how do I tell the program to use it?

    Yes, you can run your model on your local computer, but you need to make sure you have enough video memory, because at the heart of all this is the Stable Diffusion model which requires 8 GB of video memory. To see how it works use this colab

  4. Can this be used to add more/new pop culture stuff to have better results for stuff that needs more data in the existing set, and also to add new stuff that isn’t in the dataset so SD can output data on the specified things?

    Yes. Can

Thank you for reading this article. This is my first post so don’t judge too harshly. If you want to read more about this, check out the github diffusers and notebook. You can also ask me some questions, although I am not a qualified specialist and cannot tell you exactly about it, but I will help you in something.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *