Local neural networks (image generation, local chatGPT). Running Stable Diffusion on AMD Graphics Cards

Many have heard about Midjourney, but about the fact that there is a local Stable Diffusion, which can be even more, much less people already know, or they do not know that it is local. And if they tried it online, they quickly came to the conclusion that it is much worse than Midjourney and should not be paid more attention to it. And yes, SD came before Midjourney. To run enough and cpu or 4GB of video memory.

Similarly with chatGPT, few people have heard or know about an attempt to make it a local version that does not require a super computer, despite the fact that several articles have been published.

on the left is stable diffusion, and on the right... also stable diffusion

on the left is stable diffusion, and on the right… also stable diffusion

Local image generation (Stable Diffusion, on AMD video cards, on CPU)

If you just put the SD in and try to generate something, then you will see at best something like this:

It doesn’t look like a Midjourney level at all. The thing is that the default SD model does not understand what exactly you want from it without additional edits. Therefore, many keywords (prompts), both positive and negative, are required to achieve acceptable quality.

In order not to do this, there are already pre-tuned models, where it is set to have 5 fingers, and 1 body, and other parameters expected from the model. An example of the same request, but on the Deliberate model:

Another example, it seems that SD 1.5 is doing something, but it’s still far from what was expected.

Local launch

So, it doesn’t take much to run Stable Diffusion locally. Install one of the options for web interfaces. For example, the most interesting are these:

stable-diffusion-webui- https://github.com/AUTOMATIC1111/stable-diffusion-webui

InvokeAI- https://github.com/invoke-ai/InvokeAI

SHARK- https://github.com/nod-ai/SHARK

All of them differ in some feature, because of which everyone can be useful. And everyone is similar in installation, which does not differ in any complexity, just follow the instructions.

A 4GB video card will be enough to run. If there is no gpu, then running on cpu is available – then the generation takes minutes, and due to the RAM, the resolutions are much higher, but the time will be unthinkable. Depending on the power of the processor, generating 512×512 can take 3-4 minutes on a 6 core cpu.

After that, you need to find and download the model you like, which are located on the site https://civitai.com/

there are plenty to choose from

there are plenty to choose from

If you do not want to choose and compare, then you can immediately take Deliberateshe is one of the best models, she is excellent with both anatomy and fingers.

All downloaded models must be placed in the models folder, either manually, for webui this will be the path: stable-diffusion-webui\models\Stable-diffusion\, or through the interface for adding models, like InvokeAI.

The most advanced web interface can be called stable-diffusion-webui, where in addition to the basic wide features without plugins, various useful plugins are also available, for example, ControlNet – which allows you to do a lot of different things that are not available in the basic form.

In combination with the Posex plugin, you can also do poses.

Webui makes it easy to return forgotten prompts and settings, for this there is a panel with buttons under the Generate button. For the generated image, just drag it into the input field, or copy someone else’s prompts, or take info from PNG Info and click on the arrow button, after that all fields will be automatically filled.

sd has a problem with repetitive results. Due to the fact that different video cards use different computing accelerators, for example, if someone generated a picture with xformers on nvidia, on amd you will not be able to repeat it one by one, and vice versa. But at the same time, if you generate in different web interfaces with the same settings on the same machine, the result will be repeated. Perhaps the reason is something else, but in general such a problem exists.

Accelerated startup on AMD graphics cards

For its work, sd uses pytorch, which has the following launch options:

On nvidia graphics cards, acceleration is available via CUDA, and on AMD via ROCm (hip). HIP is a direct analogue of cuda, but at the moment it only works under linux. Blender for windows recently added support for hip, so in general, work is underway to port hip to windows, but for now other less convenient options are available for windows users, namely a wrapper for Vulkan and DirectML.

Thus, there are the following options:

  1. run on linux with hip. For example, stable-diffusion-webui will install everything automatically. There is support for running in docker (it won’t work under wsl2).

  2. under windows 2 options:

    • use for speeding up vulkan – running on vulkan is now possible only on sharp, it works almost at the same level as hip, a little slower. But sharp has fewer features and settings.

    • use a fork https://github.com/lshqqytiger/stable-diffusion-webui-directml with acceleration through directml. According to reviews, it does not always work stably and falls off with an error about lack of memory, where hip or vulkan work without problems. It is treated by adding startup keys, which slow down generation, but reduce memory consumption.

At the moment, for amd video cards, the fastest option is linux + hip. And with this there are some features. If everything does not start automatically and a message is displayed about the absence of a hip device, you must explicitly specify the HSA version. In the console, before starting, you need to drive one of 3 options (depending on your video card model):

export HSA_OVERRIDE_GFX_VERSION=10.3.0
export HSA_OVERRIDE_GFX_VERSION=9.0.12
export HSA_OVERRIDE_GFX_VERSION=8.3.0

Another option for the same, for example, is to enter the launch command immediately, indicating the one you need:

HSA_OVERRIDE_GFX_VERSION=10.3.0 ~/invokeai/invoke.sh

If there is a problem with low memory, or the video card does not support f16, these commands will help (only for webui, in others they look different, if at all):

# Если ошибка про half:
--precision full --no-half

# Если хочется генерацию больших разрешений на 8гб памяти:
--medvram 

# Если совсем мало памяти, например, 4гб:
--lowvram 

Each of these parameters greatly reduces the generation speed (only medvram is not significant). Also medvram will help with train.

Local chatGPT (CPU only)

With local chatGPT, the situation is not as good as with pictures. But something is working. There are a number of models that are trained to follow instructions.

To run them locally, you will need either llama.cpp (https://github.com/ggerganov/llama.cpp), or alpaca.cpp (https://github.com/antimatter15/alpaca.cpp) and find the desired model on huggingface. llama.cpp can run models from both alpaca and gpt4all and vicuna, so you can immediately select it to run.

For example, alpaca 30B: https://huggingface.co/Pi3141/alpaca-lora-30B-ggml/tree/main

Or 13B vicuna: https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main

13B is the size of the model, 13 billion parameters. It requires less memory to run than the size of the model. For alpaca 30B – 20GB of memory, and for vicuna 13B only 4GB of memory. The 7B requires very little memory and can be run on a Raspberry Pi 4.

A series of articles on Habré about these models: https://habr.com/en/users/bugman/posts/

Or even more, an attempt by enthusiasts from all over the world to make an open analogue of chatGPT: https://habr.com/en/articles/726584/

understands Russian, but do not expect that it is always and at a good level, this is still not chatGPT

understands Russian, but do not expect that it is always and at a good level, this is still not chatGPT

To run in interactive mode (as chatGPT) you need the command:

# для llama.cpp
./main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 -t 12 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-13b-4bit.bin

# для alpaca.cpp (можно указать через --threads количество потоков процессора)
./chat -m ggml-model-q4_0.bin --threads 12

13B vicuna is also better at text and generates better code than 30B alpaca due to a better approach to teaching vicuna. But 30B is 30B, the more parameters, the more “read” the bot.

Outcome

There are more and more neural networks to run locally. If you know of any other interesting ones, then share them in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *