How do neural networks work at attention? Article about self-attention and cross-attention

Picture to explain the process

Picture to explain the process

Modern generative neural networks such as Stable Diffusion or FLUXcreate images based on text descriptions using the attention mechanism – attention. This mechanism helps models both highlight important pieces of information and associate the prompt with the image so that in the end we get what we wanted.

This process is based on two types of attention: self-attention (internal attention), which determines the relationships within the image, and cross-attention (cross-attention), comparing the textual hint with its visual embodiment.

For example, in the case of the query “horse on a rocket”, cross-attention will compare the tokens (“horse” and “rocket”) and decide where to place the horse and rocket, as well as in what environment it would be better to place them, for example, on Mars.

Self-attentionin turn, helps the model understand how image elements are related to each other. For example, how the horse and the rocket will look together and how they fit together in the image. This process ensures logical coherence and integrity of the final result.

※How it works

Let's look at this process with an example: Suppose your query is − конь на ракете с надписью 'быстрый'.

▍Step 1: Convert text to embeddings using CLIP

At the very first stage, the model CLIP converts the text hint into embeddings – mathematical representations that are used during generation. These are representations of words and phrases such as “horse”, “rocket”, “fast”, which are then matched to parts of the image at each generation step.

▍Step 2: Cross-attention – the text directs the generation

Cross-attention is needed to connect a text prompt with objects in the image and uses three main components:

  • Queries (Q): vectors representing the current state of the image at the generation step (mathematical representation of the image).

  • Keys (K): Vectors representing text embeddings, such as “horse”, “rocket”, “fast”.

  • Values ​​(V): information about text embeddings that will tell you how the prompt should affect the image.

▍Attention calculation:

Attention function calculates how important each key is for each query. It does this by multiplying the query and key vectors and dividing by the square root of the key dimension (this is a technical step to ensure the values ​​are not too large).

\text{Attention}_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d_k}}

Where:

  • Q_i — requests from the current image (pixels or features),

  • K_j — keys to text embeddings,

  • d_k — dimension of the keys.

Requests (Q_i) And keys (K_j) are essentially vectors that are numerical descriptions of various elements of text and image. The queries come from the image (eg the area where the horse will be) and the keys come from the text (eg “rocket”, “fast”, “horse”).

That is, the model analyzes the image and asks the question: “Where should I place the horse?”, after which it looks at the keys and decides where it is best to place it and in what position.

If, for example, the image request at the current stage represents “horse” and the key is “rocket”, then attention shows how important the “rocket” is to the current image with the “knight”.

Features are abstract representations of an image that contain information about its key elements. The neural network uses features to learn to recognize objects and their relationships in an image, and cross-attention arranges features into a single composition.

▍Normalization of values:

Then Softmax takes the results from the previous step (that is, numbers indicating the degree of importance) and converts them into probabilities – these are the weights (\alpha_{ij}). The higher the attention the key has, the more weight it will have:

\alpha_{ij} = \text{softmax}(\text{Attention}_{ij})

If, for example, for “horse” attention to “rocket” turned out to be 2, and to “fast” – 1.5, then softmax normalizes these values ​​so that they add up to 1 and reflect the importance of each key.

▍Weighing values:

After softmax calculated the weights, the model uses them to weigh the values (V_j)which contain additional information about the keys. For example, for “rocket” this could be its shape and size, for “fast” it could be the shape of the inscription on the rocket.

If for a “rocket” the weight(\alpha = 0.6)and for “fast” (\alpha = 0.1) then the “rocket” will have a greater impact on the final result.

Total value (Output_i) will be the sum of all these values ​​multiplied by the corresponding weights:

Output_i = 0.6 \cdot V_{\text{rocket}} + 0.1 \cdot V_{\text{fast}} + 0.3 \cdot V_{\text{space}}

This means that each element of the prompt (“missile”, “fast”) will contribute to the result, taking into account its importance. If the “rocket” is important for the horse (weight (\alpha_{ij}) more for it), then the information about the rocket will have a greater influence on the current image.

▍Example:

Let's imagine that we are generating an area with a “horse”, and the keys are “rocket”, “fast”, and “space”. As a result of attention, we learned that:

Then we apply softmax to normalize probabilities (or weights) (\alpha_{ij}):

  • “rocket” is important for the current image of “horse”, and its (\alpha_{ij}) = 0.6

  • “fast” is also important, but less so (\alpha_{ij}) = 0.25

  • “space” is the least important – (\alpha_{ij}) = 0.15

Now the final state of this section of the image will be a combination of information about “rocket”, “fast” and “space”, taking into account their weights:

Output_i = 0.6 \cdot V_{\text{rocket}} + 0.25 \cdot V_{\text{fast}} + 0.15 \cdot V_{\text{space}}

As a result, the “rocket” affects the final result more than anything else, which will affect the final generation.

▍Step 3: Self-attention – parts of the generation interact with each other

After cross-attention processed and associated the text with the image, the model uses self-attention to improve communication between different parts of the image. For example, “horse” and “rocket” must be positioned correctly relative to each other, and the inscription “fast” must be on the rocket itself.

Self-attention works on a similar principle and uses the following components:

  • Queries (Q): vectors representing individual adjacent generation pixels.

  • Keys (K) And Values ​​(V): vectors representing other parts of the same image (to understand the context).

Formula for self-attention:

\text{Attention}_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d_k}}

Then softmax calculates weights:

\alpha_{ij} = \text{softmax}(\text{Attention}_{ij})

Weighted values ​​affect image adjustments:

Output_i = \sum_j \alpha_{ij} V_j

Pimagine that parts of the image consult with each other and help the model adjust the generation process – this helps the image become more coherent and correct in composition. In the process, shadows, highlights, etc. are formed.

This is where the process of identifying significance ends, and other mechanisms are activated, which I discussed in detail in my other articles.

▍Step 4: Apply CFG Scale

Cross-attention And self-attention work together at every step, adjusting the image according to the text and the internal relationships of the image.

After work cross-attention And self-attention the model applies a coefficient CFG Scalewhich allows you to control how strongly positive and negative prompts affect the final generation.

▍Step 5: Sampling method

After attention and CFG Scale corrected the image vectors, the model applies sampling method (For example, DDIM or others).

This process determines how the final noise is gradually removed and the image becomes clear.

Bottom line

Mechanism attention (attention) allows the model to go from the general to the specific, i.e. work on the image like an artist: first determine the composition and basic forms, the idea, and then relate all the objects to each other so that they look holistic.

Only then do they connect CFG Scale And Sampling method – they are an artist’s brush and help to paint where needed, gradually obtaining a detailed image.

At the same time attention is calculated at each generation step as CFG Scale And Sampling method.

Look into telegram channelwhere I write guides on Stable Diffusion And FLUX. There will also be announcements of new articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *