mechanisms of work in diffusion models

Sampling method (sampling method) in generative models such as Stable Diffusion or FLUXdetermines the way random noise is transformed into an image during the diffusion process. This method directly affects the quality, style and speed of image generation.

In the previous article I discussed how CFG Scale works and what it is needed for. I recommend you read it, as the classifier-free control method will be used here.

1. Why do we need the Sampling method?

▍Sampling method is necessary for several key tasks:

  1. Gradual reduction of noise:

    • The main goal of the sampling method is to gradually reduce noise and improve the structure of the image at each step. Each iteration step reduces the noise level and adds details, bringing the image closer to the target.

  2. Controlling the diffusion process:

    • Sampling methods control the diffusion process by determining how the model should update the image at each step. This allows control over the speed and quality of generation.

  3. Optimizing quality and speed:

  4. Stabilization of the process:

    • Sampling methods also help stabilize the generation process to avoid artifacts and unwanted distortions. This is especially important when working with complex text cues or when generating high-quality images.

2. Basic parameters of the Sampling method

  1. Image quality:

    • Some sampling methods can produce higher image quality by better preserving detail and improving realism.

  2. Image style:

    • Different methods can produce different image styles. For example, some methods may produce smoother images, while others may produce more detailed images.

  3. Generation speed:

    • Sampling methods can vary in speed. Some methods are faster but may have lower quality, while others are slower but provide better quality.

3. Basic types of sampling methods

  1. DDIM (Denoising Diffusion Implicit Models):

    • Advantages: Fast generation, ability to control style.

    • Disadvantages: May be inferior in quality to slower methods.

  2. PLMS (Pseudo Linear Multistep):

    • Advantages: Good combination of speed and quality, improved noise immunity.

    • Cons: May be more difficult to set up.

  3. DPM‑Solver:

    • Advantages: High image quality, more precise control of the diffusion process.

    • Disadvantages: Higher computational complexity and generation time.

  4. Euler A and B:

    • Advantages: Fast and stable generation, suitable for a wide range of tasks.

    • Disadvantages: May be inferior in quality to more specialized methods.

  5. LMS (Laplacian Pyramid Sampling):

    • Advantages: Good preservation of details and textures.

    • Disadvantages: May be slower than other methods.

5. Algorithm of the sampling method using DDIM as an example

In this section, we will consider the interaction process between CFG Scale and the sampling method in order to understand the mechanisms of their operation.

▍Step 1: Initialization

At the beginning of the process, we initialize random noise (x_T)which is generated according to the normal distribution:

x_T \sim \mathcal{N}(0, 1)Where T — the initial moment of time.

Noise is needed so that the model has something to denoise – this is called back diffusion and is the basis of modern generative neural networks.

▍Step 2: Calculate the resulting state based on CFG Scale

Read more about CFG Scale.

At this stage, the model generates 3 images: based on positive and negative prompts + one unconditional image – without any prompt.

The model is used here CLIP to translate the prompt into vectors (neural network language) and U-Net for direct generation.

We apply the formula CFG Scale to obtain the resulting state (x_t) based on unconditional and conditional generation:

x_t = x_{t, \text{unconditional}} + s \cdot (x_{t, \text{positive conditional}} - x_{t, \text{negative conditional}})

Where:

  • (x_{t, \text{unconditional}})— unconditional generation (main image).

  • (x_{t,\text{positive conditional}})— conditional positive generation (taking into account the positive condition).

  • (x_{t,\text{negative conditional}})— conditional negative generation (taking into account the negative condition).

(s)— a coefficient that regulates how much positive and negative conditions affect the final state of the image (we enter it manually in the interface).

▍Step 3: Calculate the modified noise

Based on the result from step 2, the model calculates a modified noise that takes into account the influence of the conditions:

\epsilon_\text{cfg} = \epsilon_\theta(x_t, t) + s \cdot (\epsilon_\theta(x_t, t, c_\text{pos}) - \epsilon_\theta(x_t, t, c_\ text{neg}))

In this step, based on the obtained (x_t)the model calculates different predicted noises (\epsilon_\theta(x_t, t), \epsilon_\theta(x_t, t, c_\text{pos}), \epsilon_\theta(x_t, t, c_\text{neg})), and then receives the modified noise \epsilon_\text{cfg}which is used in the formula DDIM to update the image state.

This step is a kind of “adjustment” based on the existing state of the image and specific conditions. Thanks to this process, the final image becomes more adapted to the conditions that were set.

This step is also necessary due to the imperfection of mathematics – to optimize calculations. It is the values ​​from the third step that interact with the formula of the sampling method. In the previous step, the basis for this is prepared.

It's like 1+1=2, where without one one you can't get two.

Calculations at this step are also carried out using U-Net.

▍Step 4: Applying the sampling method (DDIM)

In the third step, the modified noise is substituted into the sampling method formula to correct small details and stabilize the overall generation process. At this step, it will be decided where to add noise and where to remove it.

Why add noise? To increase detail. Think of your smartphone photos at night – if you remove all the noise, they will be blurry.

Sampling Method Formula DDIM:

x_{t-1} = x_t + \sqrt{1 - \alpha_t} \cdot \left( \epsilon_\text{cfg} \right) \cdot \sqrt{1 - \alpha_{t-1}}

Where (x_t) — current state of the image at the step

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *