Continuation of the article about CFG Scale

This chapter will cover all the necessary aspects for understanding the mechanisms of CFG Scale, as well as the pros and cons of the approach with mathematical examples. I decided to collect all the most important things in one place without any fluff.

In the previous part, I showed you what the CFG Scale is using illustrations, comparisons, examples and metaphors. In this part, I will demonstrate the mathematical apparatus and take a slightly different approach, showing you the internal mathematics, which turned out to be quite simple.

Definition

CFG Scale (Classifier-Free Guidance Scale) is a method developed to improve the quality of images generated by diffusion models such as DALL-E 2 and others. This method was proposed by researchers at OpenAI.

The method has replaced the “Explicit Classifiers” used in GAN models.

The method got its name Classifier-Free Guidance (classifier-free control) because it allows one to do without the traditional explicit classifier, integrating control functions directly into the generation process.

The higher the CFG Scale value, the more clearly the model “extracts” and embeds the classes or features specified in the hint into the final image.

For example, if a clue describes “a red car with mountains in the background”, a high CFG Scale will ensure that both the car is bright red and the background clearly shows the mountains.

Explicit classifiers in GANs

Traditional Generative Adversarial Networks (GAN) models use two main components:

  1. Generator – creates images from random noise.

  2. Discriminator – classifies images as real or generated.

In some variations GANsuch as Conditional GAN (cGAN), an explicit classifier is used to guide the generation process. In cGAN, additional labels or conditions are fed to the generator and discriminator, allowing them to generate images that meet certain criteria or classes.

Pros and cons of explicit classifiers

Pros:

  1. Conditional control:

    • IN cGAN The generator and discriminator work with additional labels or conditions. The generator creates images that match these labels, and the discriminator evaluates how well the generated images match the given conditions, in addition to distinguishing between real and generated images.

  2. Improving accuracy:

    • Classifier (discriminator) in cGAN helps the generator improve its results through feedback. If the discriminator successfully detects that the generated images do not match the given conditions, the generator adjusts its parameters to better match these conditions. This process iteratively improves the accuracy of the generation.

  3. Quality and realism of images:

    • Discriminator in cGAN not only checks for compliance with the conditions, but also evaluates the quality and realism of the images. This results in the generator creating more detailed and realistic images, which is an important criterion in generation tasks.

Cons:

  1. Complexity: training cGAN can be a complex task requiring balancing between the generator and the discriminator.

  2. Computing power: additional computing resources are required to train and integrate the explicit classifier.

Pros and Cons of the CFG Scale Method

Pros:

  1. Lack of a separate classifier:

    • In traditional GAN models such as cGANa separate discriminator (classifier) ​​is used, which requires significant computational resources for training and evaluation. CFG Scale eliminates the need for this additional component by integrating control functions directly into the generative model.

  2. Simplifying the architecture:

    • Simplified architecture of the model in CFG Scale allows to reduce the number of parameters and layers, which leads to a decrease in the computational load. It also reduces the amount of data processed at each stage of generation.

  3. Generation speed:

    • Eliminating the classification step speeds up the image generation process because the model does not spend additional time evaluating and adjusting based on the output of a separate classifier.

Cons:

  1. Setting: Adjusting settings to achieve optimal quality can be a challenging task.

  2. Flexibility: the model may be less specialized compared to the approach using explicit classifiers.

How CFG Scale Works

  1. Three generation streams:

    • Unconditional flow: The model generates an image without using textual guidance, based only on random noise.

    • Conditional flow: The model generates an image using a text prompt to guide the generation process (positive prompt).

    • In some models, a third stream may also be used – a negative prompt.

  2. Merging Streams:

    • At each step of the diffusion process, both flows (conditional and unconditional) are used to create intermediate images.

    • The results of these two streams are then combined using CFG Scale.

CFG Scale Stream Aggregation Mechanism

In this case, the resulting image is formed taking into account all three streams:

  • Unconditional image: ( x_{t, {unconditional}} )

  • Conditional image with positive prompt: ( x_{t, \text{positive conditional}} )

  • Conditional image with negative prompt: ( x_{t, \text{negative conditional}} )

The formula for combining might look like this:

x_t = x_{t, \text{unconditional}} + s \cdot (x_{t, \text{positive conditional}} - x_{t, \text{negative conditional}})

Where (s) — CFG Scale value.

Example of merging streams

Images in neural networks are represented as multidimensional arrays of numbers that describe pixel intensities or activations at specific layers of the network.

In the method CFG The linear combination of conditional and unconditional inference allows for flexible and efficient consideration of textual cues, improving the quality and accuracy of generated images. This process does not require multiplication of vectors or pixels, but uses simple linear operations such as addition and subtraction.

Imagine you have three arrays of numbers representing activations at a particular layer of a neural network:

  • x_{t, \text{unconditional}} = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.8 \end{bmatrix} – unconditional generation

  • x_{t, \text{positive conditional}} = \begin{bmatrix} 0.3 & 0.5 \\ 0.7 & 0.9 \end{bmatrix} – positive conditional generation

  • x_{t, \text{negative conditional}} = \begin{bmatrix} 0.1 & 0.3 \\ 0.5 & 0.7 \end{bmatrix} – negative conditional generation

If the value ( s ) equals 1, the linear combination will be:

x_t = x_{t, \text{unconditional}} + 1 \cdot (x_{t, \text{positive conditional}} - x_{t, \text{negative conditional}})

Let's substitute the values:

x_t = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.8 \end{bmatrix} + 1 \cdot \left( \begin{bmatrix} 0.3 & 0.5 \\ 0.7 & 0.9 \end{bmatrix} - \begin{bmatrix } 0.1 & 0.3 \\ 0.5 & 0.7 \end{bmatrix} \right)

Let's calculate the difference:

x_t = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.8 \end{bmatrix} + \begin{bmatrix} 0.2 & 0.2 \\ 0.2 & 0.2 \end{bmatrix}

Let's perform the addition:

x_t = \begin{bmatrix} 0.4 & 0.6 \\ 0.8 & 1.0 \end{bmatrix}

Thus, the resulting image is ( x_t ) will:

x_t = \begin{bmatrix} 0.4 & 0.6 \\ 0.8 & 1.0 \end{bmatrix}

In real applications, matrices representing images or neural network activations are much larger than the ones I gave in the example. We looked at a 4x4x3 matrix.

Impact of CFG Scale Values ​​on Quality

Let's say you want to generate an image of a “beautiful landscape”:

  • CFG Scale = 1: The image may be blurry and unclear because the text hint does not influence the generation process strongly enough.

  • CFG Scale = 15: The image may be too detailed, with artifacts and excessive details that make it less realistic.

  • CFG Scale = 7–10: This is probably the optimal range where the landscape will be detailed and realistic enough without being overly focused on individual elements.

Why High CFG Scale Values ​​Ruin Generation

At high values CFG Scale Loss of detail and realism occurs because the model begins to focus too much on text cues, which can lead to overemphasis of certain image characteristics requested in the text. Here are some reasons why this happens:

  1. Over-amplification of signs:

    • High values CFG Scale may cause the model to place too much emphasis on specific aspects mentioned in the text cues. This may lead to distortions and disproportionate amplification of some characteristics of the image, which reduces its realism.

  2. Noise and artifacts:

    • When a model relies too heavily on a textual cue, it can introduce noise and artifacts as the model tries to match all aspects of the cue, even if they contradict each other or do not fit together naturally.

  3. Loss of common context:

    • Relying too heavily on textual cues can lead to ignoring the overall structure and context of the image. The model may miss important details that make the image coherent and realistic.

  4. Limiting the model's creativity:

    • At high value CFG Scale the model is limited in its ability to “imagine” and use the internal patterns and regularities it has learned from the data. This can result in less detailed and more formulaic images.

Conclusion

CFG Scale (Classifier‑Free Guidance Scale) improves image generation by diffusion models by integrating text cues without using a separate classifier. This reduces computational costs, simplifies the model architecture, and allows flexible control over the influence of conditions on the final image. The method provides high accuracy and quality, making it useful for a variety of applications that require precise matching of given conditions.

I'll be glad to see you in telegram channelwhere I write guides on Stable Diffusion and FLUX.

How do you like this dry summary of facts? Or is it better to dilute the information with infographics and pictures?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *