Drawing with CLIP Guided Diffusion HQ

UPD: this article was written before the release of the most interesting material about the neural network


… We decided to publish it anyway, so that readers will have the opportunity to compare images generated by domestic and foreign networks. Further text is published without changes.

In the days of the old Bash, I remember one quote:

Tell me which program to distill books from txt to mp3
^^^^^ No Comment why not immediately in 3gp or XviD?
What is your audiobook format?
Or do you think that some fool is sitting and reading in front of the microphone?

Well, if we do not impose too high demands on the realism of the result, we can say that today we have such “programs”. We are talking, of course, about neural networks that can generate almost any kind of content.

A few years ago, we laughed at psychedelic pictures, where eyes or dog faces were sticking out from everywhere. At the beginning of this year, we were surprised at the possibilities of the network DALL-E from Google, which, for a rather complex text query, created images worthy of a brush (or stylus) of a professional illustrator. Today, anyone can try networks that are similar in principle of operation, but for some reason few people know about it. I will try to fix this with the help of this article, in which I will tell you about my experience of interacting with the CLIP Guided Diffusion HQ neural network.

Let me make a reservation right away that I am not an expert in neural networks at all. For those looking for more details, there is a wonderful piece by Dirac on how the CLIP half of this network works. My article is nothing more than an attempt to popularize an interesting instrument, an invitation to creativity and reflection. can a robot turn a piece of canvas into a masterpiece of art… It is written more in an entertaining way, and from the useful it contains instructions that will tell you how to create your own neural network masterpieces in just a couple of clicks.

For those who are not ready to read a long article on the link, I will briefly say that CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on 400 million text-image pairs from the Internet. She offers the most likely captions that could accompany this or that picture, and she also copes well with objects that were not in her training set.

Diffusion HQ is a diffusion image generation model implemented using the ImageNet algorithm. It gradually removes noise from the “seed” picture, over and over again making it clearer and more detailed. CLIP acts as a kind of critic for Diffusion HQ, checking each intermediate picture for whether it matches the input line more or less, and adjusting the generator’s operation in one direction or another. Thus, in a few hundred iterations, even from a completely random set of pixels, detailed images are obtained.

Typical seed. If only you knew from what rubbish …

The easiest way to give CLIP Guided Diffusion HQ a try is with Google’s Colab Notebook, prepared by Catherine Crowson. “Notepad” in this case is a ready-made environment for executing Python code. In principle, the model can be run on a local computer, but you need a powerful video card with at least 8 GB of video memory, and today not everyone can boast of such a treasure. Therefore, it is easier to use a remote virtual machine.

The algorithm of actions is as follows:

1. Log in here under a google account.

2. Click “Connect” in the upper right corner. Most likely, you will see data on the workload of the GPU-based virtual machine.

If you’re unlucky (or if you’ve been abusing Google’s bounty before), you might get a message that only the CPU is available. In this case, only your grandchildren will see the result of the work.

3. Check which GPU you were provided with. To do this, place the cursor on the first block of code and press Ctrl + Enter to execute it. Information about the graphics accelerator of the virtual machine will appear on the screen.

Most likely, you will get a Tesla K80, on which a picture in good quality is considered half an hour or an hour. But if you’re lucky (most often it happens at night), you can get a Tesla T4, and this already means acceleration by almost an order of magnitude, plus some additional features, which I will talk about at the end.

4. Set the parameters in the block Settings for this run

Write your text query in the line promptskeeping parentheses and quotes. If you want the neural network to start working not from scratch, but from your starting image, insert its URL in single quotes instead of None here:

init_image = None

For the first try, I recommend the parameters skip_timesteps = 300 and init_scale = 1000

The number of iterations is set in a block Model settings parameter timestep_respacing… The value must be in single quotes.

I would bet 500 to start with, especially if you got a weak accelerator. 1000 is necessary for maximum elaboration and ringing clarity of details.

5. Press Ctrl + F9 (or Runtime -> Run All). It will take time to install the necessary packages and download the model itself, usually within ten minutes. Next, the actual calculation process will begin, which is displayed by the progress bar below:

After a while, depending on the given number of iterations, you will get an image with a size of 256 × 256 pixels. Sparsely, but these are the limitations of the dataset used and the available hardware. We will talk about how you can raise the resolution at the end of the article.

6. To create a new image within the same session, you do not need to restart all the code. Place the cursor where the changes were made (Settings for this run or Model settings), and press Ctrl + F10 (Runtime -> Run Below).

In the meantime, your pictures are counted, I propose to see what happened with me.

At the request of “Habr” the network could not depict anything definite. Apparently, there were very few pictures with such a text comment in her sample.

I tried to help her by adding more familiar words – “Habr community of IT specialists”. The result is something similar to the skyscrapers of an Asian metropolis, on one of which, if you wish, you can read the inscription “Habr”. In general, I noticed that when CLIP Guided Diffusion HQ fails to portray something well, she tries to at least sign it.

The network handles abstract landscapes much more confidently. For example, this illustration for the cycle “A Song of Ice and Fire” can be put on the cover of a metal album even now.

Queries based on popular media franchises almost always give good results. Let’s say Mad Max:

At some point, I became curious if the training sample included Cyrillic texts. I formulated the most simple query, according to which it would be obvious whether the network understood me or not – “blue triangle”. In general, do not repeat my mistakes!

Now this guy will appear to me in nightmares.

If you write the same text in Latin, the result will not be better. Hoba!

This means that you can forget about the Russian language.

At the request of “steampunk” we managed to get intricate patterns of brass pipes and valves, a kind of Tsar saxophone.

But in general, the neural network is not on friendly terms with technology. No matter how many times I asked her to draw an airplane, a steam locomotive or a tank, it turned out to be some kind of nonsense.

You should not expect from her to understand exactly what you mean. For example, entering the query “Heroes III”, I was sure that I would get some kind of variation on the theme of screenshots from the game. And I got this:

Well, here are vaguely guessed figures in armor and with weapons … Apparently, these are heroes. Three jokes!

Curious pictures are obtained when the network comes across words with multiple meanings. Let’s say “Red Square” is both “Red Square” and “Red Square”. The neural network could not know what exactly they wanted from it, therefore, just in case, it generated a picture “for both ours and yours” – a red square, but with a paving structure.

A bit of crypto for the request “The Hound of the Baskervilles”.

“It’s me, Sir Henry. Help me out of the dog! “

We managed to get a wonderful triptych by requests “mad [french, german, russian] scientist “:

So we learned that the essential attributes of a mad scientist are a nimble on his head and glasses, and further options are possible.

And here is an attempt to create a cover for the book “Do Androids Dream of Electric Sheep?” The result is an image with two smartphones connected to the mains (apparently, on Android), on the screens of which are depicted sheep.

– What are the claims? Are there any androids? There is. Do you have sheep? There is. Electric? Well, it is clear that they are not alive! Get out of here, leather bag, you don’t know what you want!

The neural network honestly tries to interpret all the words that you included in the request, and if it knows something similar, it can get an interesting result. There are no formal rules – just write whatever comes to mind. For example, adding to the request by , you can get “pictures” in his style.

Harry Potter by Wassily Kandinsky and Robinson Crusoe by Claude Monet

If you need a specific color, you can also specify it – it will most likely work. I will repeat a fragment of the picture from the announcement:

Scientific certainity by Salvador Dali in blue

The rest of the illustrations from there are “Alchemist by Boris Vallejo”, “The thinking ocean of the planet Solaris”, “The Picture of Dorian Gray by Giuseppe Arcimboldo”, “The Lord of the Rings by Arnold Böcklin” and “Extent of impact of deep -sea nodule mining midwater plumes is influenced by sediment loading, turbulence and thresholds ”(I fed the name of the scientific article to the network).

If you need to finish painting (or make it more addictive) an existing picture, its URL must be added to the line init_image… Parameter skip_timesteps determines how many iterations the neural network can leave in its fantasies from the proposed image, and clip_guidance_scale indicates how strictly to adhere to it.

Good results are not always obtained. For example, an attempt to cross the images of God the Emperor and Vladimir Putin only made the Emperor more displeased.

I promised to tell you how to get pictures larger than 256×256. The first and easiest way is to use another neural network for this. The task of “increasing the resolution while maintaining clarity” is much easier than the task of “drawing a picture from a text description”, so there is plenty to choose from. I suggest that you familiarize yourself with review and try different options. There are completely free, there is a trial period. Some illustrations directly benefit greatly from upscale.

Something based on Stalker, I don’t remember the exact request. 4x magnification through deep-image.ai

And, finally, about the additional capabilities that the Tesla T4 accelerator gives. It has 16 GB of memory, which means that you can run on it advanced version the same neural network that immediately produces images with a size of 512 × 512 pixels.


As mentioned at the beginning, CLIP is a stand-alone module that can be interfaced with various generators. For example, interesting results are produced by the CLIP spark and the more classical generative adversarial network VQGAN… This model is not so demanding on resources (even the Tesla K80 can create 512 × 512 images). This is how she interpreted the request about androids and the electric sheep:

You can endlessly create and look at pictures, but it’s time to finish the article. I hope that by playing with the neural network, you will not only raise your spirits, but also learn new ideas for creativity. For example, my friend, who writes science fiction literature in his spare time, seriously thought about whether he should illustrate his books with the help of neural networks. Share in the comments the interesting pictures that you get!

Similar Posts

Leave a Reply