Speeding up diffusion several times? – about the new ImagineFlash model from Meta

It is important that we omitted some mathematical details in the translation of the article. We summarized the mathematical expressions in text. Comments are highlighted in italics so that it is easier for beginners to read, and the highlighted level of preparation justifies itself and there is no need to put a “difficult” column under the article.

To prevent the article from becoming too long, its main part with the method is given. We give a short version of the translation of the research from Meta.

Inside the original you can see the results of the work. And specific metrics in the tables.

The entire article remains written in the first person.

Summary of the article:

Diffusion models are powerful generative neural networks with computationally expensive results. Modern acceleration methods degrade the quality of generations, and in the case of complex generation with a small number of iterations, they fail altogether.

In this paper, we (Meta) propose a new distillation approach adapted for quality generation with a variety of choices and using only one to three “steps”, aka iterations.

Our approach includes three key components: backward distillation, which mitigates training-inference divergence by calibrating the learner to its own backward trajectory; biased recovery loss, which dynamically adapts knowledge transfer based on the current time step; and noise correction, which improves sample quality by removing singularities in prediction noise.

Through experiments, we will show that our method is superior to existing approaches in quantitative metrics and human assessments.

Remarkably, it achieves performance comparable to the teacher model using only three noise reduction steps.

Introduction

Generative modeling has undergone a significant shift with the advent of diffusion noise models (DMs).

These models have set new standards in a variety of fields, offering an unprecedented combination of realism and variety while providing a consistent learning experience.

However, the sequential nature of the noise reduction process is a serious problem.

Sampling from DMs is a time-consuming and expensive process, the execution time of which largely depends on two factors: (i) the neural network evaluation delay at each step and (ii) the total number of denoising steps.

Significant research efforts have been directed towards speeding up the sampling process.

For text-to-image synthesis, the proposed methods cover a wide range of techniques, including higher-order “solvers”, modified diffusion transformers for curvature reduction, and directional, stepwise, and consistency distillation.

Directional distillation uses a trained model (mentor) to transfer knowledge to a student model, preserving the key characteristics of the original while reducing its complexity.

The student imitates the behavior of the mentor, achieving higher accuracy or efficiency than in traditional learning. Unlike the classical approach, additional techniques such as data augmentation can be used.

Stepwise distillation is an iterative process where the student gradually masters increasingly complex problems, which contributes to the better generalization ability of the model.

Consistent distillation focuses on the stability of a model's predictions when input data or parameters change, reducing the risk of overfitting and improving robustness to real-world data.

These methods have achieved impressive results, achieving very high quality using only about 10 steps. More recently, hybrid methods that use both distillation and adversarial loss have pushed the frontier to five steps or less…

Adversarial loss is a function that measures the quality of discrimination between real and generated data.

While these methods achieve impressive quality for simple queries and uncomplicated styles such as animation, they suffer from quality degradation for photorealistic images, especially on long prompts and descriptions.

A common theme among the methods mentioned is an attempt to match the student's low-step model with the teacher's complex paths.

Recognizing this as a limitation, we invert the process, proposing a new distillation structure that is designed to improve the student's learning along his own diffusion paths.

In short, our (Meta) contribution is threefold:

– First, our approach introduces backward distillation, a distillation process designed to calibrate the learner model on its own upward backward trajectory, which reduces the gap between the training and output distributions. We obtain zero data leakage during training at all time steps.

– Second, we propose a biased reconstruction function that dynamically adapts knowledge transfer from the teacher model.

Specifically, the function is designed to distill global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency information at low time steps – all of which allows for efficient transfer of both global and fine-grained details between the teacher model and the student model.

In conclusion, we propose Noise Correctionn (noise correction), an inference-stage modification that improves the quality of samples by addressing the problems of features present in noise prediction models at the initial sampling stage.

This no-training technique eliminates the problem of reduced contrast and color intensity that typically occurs when working with a small number of noise reduction steps.

By combining these three innovative components, we apply our distillation system to the basic diffusion model, Emu [4] – we obtain Imagine Flash, which achieves high-quality generation under extremely low step count conditions without compromising sample quality or condition fidelity (see Fig. 2).

Diffusion models, unlike previous generative models (e.g. GANs), approach density estimation and data sampling in an iterative manner, gradually reversing the noise imposing process.

This iterative nature results in multiple queries to the neural network core, resulting in high inference costs.

As a result, a significant amount of work has focused on developing faster and more efficient methods for sampling from diffusion models.

However, improving output speed without compromising image quality and text matching accuracy remains a significant challenge.

Other studies

Early approaches focused on developing better solutions for the underlying dynamics of the diffusion process. In this direction, several works propose exponential integrators, higher-order solutions (superstructure solutions), and specialized methods for specific models.

Other studies consider reformulations of the diffusion process to minimize curvature in both forward (noise) and backward (noise-cancelling) trajectories.

In short, these approaches aim to linearize the inference path, allowing larger steps to be used and hence reducing the number of steps in the inference stage.

Correction of curvature in diffusion models (diffusion models) is associated with improving the quality of data generation by taking into account the geometry of the data space.

For diffusions that gradually transform a simple distribution (e.g. normal) into a complex target distribution, curvature reflects deviations of trajectories from a linear or flat path in a high-dimensional space.

If curvature is not taken into account, models may distort the data, resulting in poor quality of the generated samples.

Curvature correction allows diffusion models to take into account the true geometry of the data space, resulting in a more accurate reconstruction of the target distribution.

This improves the realism and quality of the generated data, making it more believable and closer to the original.

Despite the significant reduction in steps thanks to these “curve straightening” techniques, there is a limit to how large the output step can be without reducing image quality.

Here the authors want to emphasize that the researchers focused on transforming the algorithms and the mechanism of classical diffusion in general, trying to modify it. But they did not strive to completely transform the standard approach to diffusions.

Model size reduction: A number of works are aimed at reducing the cost of a step. In this vein, several studies focus on using smaller architectures and replicating “mobile networks”

Reducing the cost per step is also addressed by minimizing the cost of conditional generation at each iteration or caching intermediate activations in the network bases.

In this context, guide distillation is proposed, while a non-training alternative is presented, consisting of guide truncation. Reducing the delay per step leads to a significant increase in inference speed.

However, to truly scale inference for real-time applications, these advances must be coupled with further reductions in the number of steps to a small number of single digits.

Here the authors emphasize that reducing the size of the model, reducing it to very small/mobile versions, ultimately still led to an increase in the speed of the neural network (data output).

Reducing the number of sampling steps: One way to further reduce output latency is to distill steps.

In these works, the authors propose a progressive approach to distilling two or more steps into one.

Although these approaches achieve significant step reduction, significant quality degradation is observed for small step counts.

To compensate for the loss of quality, another “front” of researchers proposes additional improvements to learning during distillation.

Namely, ADD, Lightning and UFOGEN. They add adversarial losses to improve the quality of samples.

While the above distillation methods certainly produce impressive results using just one generation step, these improvements are still insufficient for many practical applications: photorealistic image generation, long-project image generation.

The smart approach is to manage the trade-off between quality and speed.

In practice, this translates into methods that allow a small increase in steps (from 2 to 4 steps) with a significant improvement in quality.

We apply this approach in our method. To achieve better quality, we suggest distilling along the student's reverse path rather than along the direct path.

In other words, instead of the student imitating the teacher, we use the teacher to improve the student based on his current knowledge.

We found that this approach produces competitive results with one inference step and significantly improves quality and accuracy with only a small increase to three steps.

Methodology and Imagine Flash

We represent Imagine Flasha novel distillation technique designed to quickly convert text to image, based on, but not limited to, the EMU model.

Unlike the original Emu model, which requires at least 50 neural feature evaluations (NFE) to create cool images, Imagine Flash achieves comparable results in just a few such evaluations.

EMU (Efficient Meta-University) is a machine learning model designed to transfer knowledge between tasks more efficiently and universally.

The main idea of ​​the model is to combine meta-learning and knowledge transfer methods, which allows improving performance on new tasks with a minimum amount of data. It is usually designed for general-purpose tasks with dynamic data changes.

The proposed distillation method includes three key components: reverse distillation, a distillation process that ensures zero data leakage during training at all time points t.

Biased Reconstruction Function (SRL), an adaptive loss function designed to maximize knowledge transfer from the teacher.

Noise correction, a training-free inference-stage modification that improves the quality of samples generated by low-step methods trained in noise prediction mode.

In the following, we assume access to a pre-trained diffusion model that predicts noise estimates.

This teacher model can work in both image space and latent space.

Diffusion models, unlike previous generative models (e.g. GANs), use an iterative approach to density estimation and data sampling, gradually changing the noise generation process.

This iterative nature results in multiple queries in the backbone neural network, resulting in high inference costs.

Much work has focused on creating faster and more efficient ways to sample from diffusion models. However, improving the speed of inference without sacrificing image quality and text accuracy still represents a major challenge.

General provisions on diffusion models and reverse distillation

Here the authors describe the work of diffusion and distillation. We will not write in italics, although we have replaced mathematical formulas with natural language to simplify perception.

Diffusion models consist of two interrelated processes: forward and backward. The forward diffusion process gradually distorts the data by interpolating between sample data points.

Diffusion models in machine learning operate by gradually transforming data through the process of adding noise and then restoring the original data.

Mathematically, this is represented as a sequence of steps in which noise is added to the original data, which gradually increases its entropy, bringing the distribution of the data closer to a simple distribution, such as a normal distribution.

The model is then trained to solve the inverse problem: starting with highly noisy data, step by step reducing the noise, restoring the original data.

This process is modeled using a stochastic differential equation that describes how the noise evolves over time and how this evolution can be reversed to obtain the original data.

SDE is the stochastic differential equation that describes the evolution of noise over time.

As a result, the model learns to efficiently transition from a simple distribution to a complex target distribution, allowing it to generate new data that matches the distribution of the original data.

The distillation process in machine learning, on the other hand, is a method of compressing information from a more complex model, called a mentor, to a simpler model, called a student.

Mathematically, this is accomplished by minimizing the difference between the outputs of the mentor and the student.

The mentor generates predictions or probability distributions that the student attempts to reproduce.

During the learning process, the student uses these predictions as additional information, which helps him understand the structure of the data better than if he were learning only from the original data labels.

This results in the student, being a simpler model, being able to achieve performance close to that of the tutor, but with less computational cost.

Distillation thus allows complex knowledge and ideas to be transferred from a large model to a smaller one while preserving the important characteristics and performance of the model.

Conversely, the back diffusion process is designed to eliminate the noise generation process and generate samples.

According to Anderson's theorem, the direct SDE introduced earlier satisfies the diffusion equation in reverse time, which can be reformulated using the Fokker-Planck equations

Anderson's theorem says that if we have a direct SDE describing this process, then there is a corresponding reverse time diffusion equation.

This means that one can formulate a process that reverses the forward diffusion – that is, given how the data is noisy, one can describe how that noisy data can be reconstructed in the opposite direction, removing the noise and getting closer to the original state.

This inverse process is described by another SDE, which, under certain conditions, can be found and used to train a model that reconstructs the data.

Now, as for Fokker-Planck equationsThis equation describes the evolution of the probability density distribution of the state of the system over time.

In the context of diffusion processes, it defines how the probability distribution of data changes over time under the influence of noise (in the forward process) or how this distribution must change if we aim to reconstruct the data (in the backward process).

The Fokker-Planck equation is related to the SDE through the relationship between the trajectory of an individual state (described by the SDE) and the general behavior of the entire probability distribution (described by the Fokker-Planck equation).

So, when it is said that “the forward SDE satisfies the diffusion equation in reverse time”, it means that there is an inverse process that can be described by the SDE, and which formally corresponds to the backward Fokker-Planck equation.

That is, knowing how the distribution changes in the forward process (adding noise), we can describe and model how this distribution should change in the reverse process (removing noise).

The authors propose a model of alternative, reverse distillation.

For quality samples, Imagine Flash achieves comparable results with just a few NFEs.

The distillation method includes three new key components: Backward distillation, a distillation process that ensures that there is no data leakage at all time points t; Biased Renewal Loss (SRL), an adaptive loss function designed to maximize knowledge transfer from the mentor model.

Shifted Reconstruction Loss (SRL) is an estimation method designed to optimize the process of knowledge distillation between models.

In the context of generative models such as diffusion models, SRL tunes the loss function to better suit the goal of transferring knowledge from a more complex model (the mentor) to a less complex model (the student).

This bias allows the output of the student model to be adaptively adjusted to more accurately reflect the predictions of the mentor model. As a result, SRL facilitates more efficient training of the student model, improving the quality of data recovery and speeding up the generation process.

The SRL process involves shifting the standard loss function to account for the specifics of the problem and the characteristics of the data generation.

But here SLR works in such a way that we get the reverse distillation process.

It is widely recognized that traditional noise schemes often fail to achieve zero final signal-to-noise ratio (SNR) at the final step, which creates a mismatch between training and inference.

In particular, the noise scheme is usually chosen such that the final state is not pure noise, but contains low-frequency information leaked from the original data.

This mismatch results in poor performance during inference, especially if only a few steps are used.

To address this issue, some researchers propose redesigning existing noise schemes to ensure zero final SNR.

However, we argue that this solution is insufficient, since information leakage occurs not only at the end of the process, but also at all time steps during the forward diffusion process.

During training, the model learns based on information from the true signal, which results in errors being preserved in subsequent steps.

The closer to the beginning of the process, the more information about the signal is stored and the more difficult it is to correct errors.

To address this issue, we propose a reverse distillation method that ensures signal consistency between training and inference at all time steps.

Instead of starting training with a forward noisy latent code, we first iterate backwards on the student model to obtain an intermediate state, and then use this state as input to train both the student and teacher models.

This avoids dependence on the true signal during training, improving the agreement between training and inference.

What's going on with SLR?

In the process of generating images via back diffusion, the early stages (when the process is close to completion) play a key role in forming the overall structure and composition of the image. In contrast, the late stages (when the process is just beginning) are important for adding fine details.

Based on this observation, we developed improvements to the standard knowledge distillation loss that help the student model learn both structural composition and image detail, just like the mentor model does. We call this method biased reconstruction loss.

In this approach, instead of starting with the current noisy image, we create a target image based on the student model's prediction, which is then noised to a certain level to fit the new conditions.

As a result, the gradients are updated in such a way that in the early stages of training the student model learns to preserve the overall structure of the image, and in the later stages it focuses on improving fine details.

Unlike traditional methods where both the mentor and the student start from the same initial image, in our offset method the starting point for denoising the mentor model is different from the starting point of the student model.

The bias function is designed such that for larger values ​​of time (closer to the beginning of the process), the goal given by the mentor model has global similarity to the student's result, but with improved textual and semantic consistency.

For smaller values ​​of time (closer to the end of the process), the target image contains more detailed features while maintaining the overall structure predicted by the student model.

This approach helps the student focus on forming the structure of the image in the early stages and creating finer details in the later stages.

Noise correction

Most modern diffusion models are trained in noise prediction mode. This means that the model's task is to separate noise from signal based on a randomly contaminated image. However, the process of generating images starts from a point where the image is pure noise.

At this point, there is no signal, and therefore noise prediction becomes trivial and uninformative for image generation. To solve this problem, existing methods modify the noise scheme so that the update is initially more informative and neutral, moving to velocity prediction.

However, the transition to speed prediction requires additional effort in training the model. In this regard, modern methods that focus on the minimum number of steps continue to use the noise prediction mode, but calculate losses based on estimates obtained from the model.

This circumvents the problem of triviality of noise prediction at the initial stage, but may introduce biases into the first step of the update.

When a model is trained to predict noise, its signal estimate is based on the noise itself, which distorts the results. As a result, the denoising process can accumulate errors, since the model should not simply predict the noise, but take into account the current results.

We propose a simple solution that does not require additional training, which allows noise prediction models to be used without this bias.

We treat the initial case as special and replace the model noise with true noise to adjust the update process. This small change significantly improves the color estimation, making them brighter and more saturated.

This effect is especially noticeable when the number of steps in the generation process is small. We explore the impact of this noise correction in more detail in the following sections and provide examples of improvements in the application.

Conclusion

We introduced Imagine Flash, a novel distillation framework that enables high-quality image generation in just a few steps using diffusion models.

Our approach includes three key components: Backward Distillation, which reduces the discrepancy between training and inference, Shifted Reconstruction Loss (SRL), which dynamically adapts knowledge transfer at each time step, and Noise Correction, which improves the quality of the initial sample.

Through extensive experimentation, Imagine Flash demonstrates outstanding results comparable to the performance of a pre-trained teacher model using only three denoising steps, and consistently outperforms existing methods.

Here are some tables:

Comparison of methods by key benchmarks:

Comparison of the importance of individual parts of the technology for the whole Imagine Flashe (mostly from reverse distillation and SLR)

Comparing ImagineFlash with SOTA Machine Learning in this direction

This unprecedented sampling efficiency, combined with the high quality and diversity of samples, makes our model ideally suited for real-time generation applications. Thus, Meta has moved ahead of other corporations, at least in terms of speed.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *