Highly Efficient Image Generation on KerasCV with Stable Diffusion

Today we will show how to generate new images from a text description using KerasCV, stability.ai and Stable Diffusion. Material prepared for the start of our flagship course in Data Science.

Stable Diffusion is a powerful and open source model for generating images from text descriptions. There are many open source solutions for generating images from descriptions, but KerasCV stands out from them with a number of advantages, including compilation of XLA (Accelerated Linear Algebra) and support “mixed precision”. Together they make it possible to achieve very high generation rates.

Today we’ll walk through KerasCV’s implementation of Stable Diffusion, show you how to use these powerful performance tools, and explore the benefits they provide.

First, let’s install the dependency packages and deal with the modules:

!pip install --upgrade keras-cv
import time
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt

Introduction

Unlike other tutorials, it first explains how the implementation takes place. In the case of generating images from text descriptions, it is easier to show than to tell.

Let’s see how strong keras_cv.models.StableDiffusion().

Let’s build the model first:

model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)

Then we will create a text description. For example, “photo of an astronaut riding a horse” (the original request is “photograph of an astronaut riding a horse”):

images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=3)


def plot_images(images):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        plt.imshow(images[i])
        plt.axis("off")


plot_images(images)
25/25 [==============================] - 19s 317ms/step

Just unbelieveble!

But this is not all the possibilities of the model. Let’s try to enter a request more difficult. For example, “cute magical flying dog, fantasy art”, “gold color, high quality, highly detailed, graceful shape, sharp focus”, “concept design, character concept, digital painting, mystery, adventure”:

images = model.text_to_image(
    "cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
plot_images(images)
25/25 [==============================] - 8s 316ms/step

There are simply no limits to the possibilities (at least they reveal all the hidden diversity of Stable Diffusion).

Wait, how does this even work?

Whatever you might imagine right now, there is nothing magical about Stable Diffusion. It’s kind of like a “latent diffusion model”. Let’s get to the bottom of the meaning of this term. You may be familiar with the principle ultra high resolution (super-resolution): you can train a deep learning model noise removal on the original image. This will turn the image into a high resolution version. A deep learning model cannot magically “pull” the lost information out of the noise in a low resolution photo. Instead, she draws on the training data so as to create the illusion of the most likely details. Learn more about ultra-high resolution in the following Keras.io articles:

And, if you squeezed everything out of this idea, then ask yourself: what if you run the model on “pure noise”? Then she would have to “remove the noise from the noise” and create a completely new image. Repeat this many times, and a small piece of noise will turn into an amazingly clear, high-resolution artificial image.

The key idea of ​​latent diffusion is stated in High-Resolution Image Synthesis with Latent Diffusion Models in 2020. To better understand the principles of diffusion, you can see the tutorial on Keras.io Implicit Diffusion Models for Noise Removal (Denoising Diffusion Implicit Models).

To move from latent diffusion to a system of creating an image from a text description, you need to add just one key feature: managing generated images through text description keywords. This will help “stabilization” (conditioning) – a classic method that consists in binding a vector to a noise fragment, which is a piece of text, and then training the model on a dataset of pairs of images and descriptions {image: caption}.

This gives rise to the Stable Diffusion architecture, which consists of three blocks:

  • text encoder. The block converts the text description into a latent vector (latent vector);
  • diffusion model. It repeatedly removes noise from a 64×64 image fragment in a latent state;
  • the decoder turns a fragment of the final 64×64 image into an image with a resolution of 512×512.

First, the textual description is projected into a previously studied space of eigenvectors, a language model with “frozen weights”. This eigenvector is then associated with a randomly generated noise chunk that goes through “noise removal” iterations in the decoder. By default, there are 50 of them. And the more, the more beautiful and clearer the final image. By default, a fragment goes through 50 iterations.

Finally, the 64×64 latent image is passed through the decoder, resulting in a correct high-resolution rendering.

Stable Diffusion Architecture

Stable Diffusion Architecture

This is a very simple system: four files with 500 lines in total are enough to implement Keras:

However, after studying billions of images and their textual descriptions, the workings of this simple system seem like magic. Feynman said about the universe: “It’s not complicated, it’s just a lot!”

Bonuses KerasCV

Why, of the several publicly available implementations of Stable Diffusion, it is worth using keras_cv.models.StableDiffusion?

In addition to the simple API, KerasCV’s Stable Diffusion model provides important benefits:

  • implementation in graph mode;
  • compiling XLA with jit_compile=True;
  • support for mixed-precision calculations.

Combining all these advantages, KerasCV’s Stable Diffusion model is orders of magnitude faster than naive implementations. This section describes how to enable all of these features and how to get more efficient when using them.

For the sake of interest, we compared the implementation time of Stable Diffusion for “diffusers” from HuggingFace and from KerasCV. In both cases, the task was to generate 3 images with 50 iterations for each. Testing was carried out on Tesla T4.

Source benchmarks is publicly available on GitHub, you can rerun it on Colab and reproduce the results. Here are the test results for the generation time:

On Tesla T4, generation is accelerated by 30%! Although the improvement on the V100 is not as impressive, in general we expect that the results of such tests on all NVIDIA GPUs will always be in favor of KerasCV.

For the sake of completeness, we have given the generation time at a cold start and at a warm start. The cold start time includes the one-time cost of creating and compiling the model, so in a production environment where you will use the same model instance many times, it is negligible. Nevertheless, here are the numbers for a cold start:

Although the results of completing the tasks in this tutorial may vary, in our test, the implementation of Stable Diffusion on KerasCV turned out to be much faster than on PyTorch. To a large extent, this may be due to the XLA compilation.

Efficiency improves with each optimization and can vary markedly from one hardware configuration to another.

First, let’s test our unoptimized model with the text description “Cute otter holding shells in a rainbow swirl. Watercolor”:

benchmark_result = []
start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Standard", end - start])
plot_images(images)

print(f"Standard model: {(end - start):.2f} seconds")
keras.backend.clear_session()  # Clear session to preserve memory.
25/25 [==============================] - 8s 316ms/step
Standard model: 8.17 seconds

Mixed Precision

“Mixed Precision” uses calculations with float16 precision, while storing weights in float32 format. Because of this, float16 operations are supported by much faster cores than the float32 counterparts on modern NVIDIA GPUs.

Use mixed precision in Keras (including for keras_cv.models.StableDiffusion) simply:

keras.mixed_precision.set_global_policy("mixed_float16")
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA A100-SXM4-40GB, compute capability 8.0

It just works.

odel = keras_cv.models.StableDiffusion()

print("Compute dtype:", model.diffusion_model.compute_dtype)
print(
    "Variable dtype:",
    model.diffusion_model.variable_dtype,
)
Compute dtype: float16
Variable dtype: float32

As you can see, the model assembled above uses mixed-precision calculations. And we use the speed of the float16 operation when storing variables with float32 precision.

# Warm up model to run graph tracing before benchmarking.
model.text_to_image("warming up the model", batch_size=3)

start = time.time()
images = model.text_to_image(
    "a cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Mixed Precision", end - start])
plot_images(images)

print(f"Mixed precision model: {(end - start):.2f} seconds")
keras.backend.clear_session()
25/25 [==============================] - 15s 226ms/step
25/25 [==============================] - 6s 226ms/step
Mixed precision model: 6.02 seconds

XLA compilation

TensorFlow includes built-in XLA compiler XLA: Accelerated Linear Algebra – Accelerated Linear Algebra. keras_cv.models.StableDiffusion supports the jit_compile argument out of the box. True will enable XLA compilation for this argument, resulting in a significant speedup.

Let’s use this in the “avocado chair” example:

# Set back to the default for benchmarking purposes.
keras.mixed_precision.set_global_policy("float32")

model = keras_cv.models.StableDiffusion(jit_compile=True)
# Before we benchmark the model, we run inference once to make sure the TensorFlow
# graph has already been traced.
images = model.text_to_image("An avocado armchair", batch_size=3)
plot_images(images)
25/25 [==============================] - 36s 245ms/step

image

Let’s evaluate the effectiveness of our XLA model:

start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA", end - start])
plot_images(images)

print(f"With XLA: {(end - start):.2f} seconds")
keras.backend.clear_session()
25/25 [==============================] - 6s 245ms/step
With XLA: 6.27 seconds

On the A100 GPU, we got about a twofold speedup. Miracles!

Let’s put it all together

And how to assemble the world’s most productive (as of September 2022) stable diffusion inference pipeline?

With two lines of code:

keras.mixed_precision.set_global_policy("mixed_float16")
model = keras_cv.models.StableDiffusion(jit_compile=True)

and the text description “Teddy bears are doing machine learning research”:

# Let's make sure to warm up the model
images = model.text_to_image(
    "Teddy bears conducting machine learning research",
    batch_size=3,
)
plot_images(images)
25/25 [==============================] - 39s 157ms/step

How fast does it work? Let’s find out now! Let’s now have “A mysterious dark stranger visits the Egyptian pyramids”, “high quality, high detail, graceful shape, sharp focus”, “concept design, character concept, digital painting”:

start = time.time()
images = model.text_to_image(
    "A mysterious dark stranger visits the great pyramids of egypt, "
    "high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA + Mixed Precision", end - start])
plot_images(images)

print(f"XLA + mixed precision: {(end - start):.2f} seconds")
25/25 [==============================] - 4s 158ms/step
XLA + mixed precision: 4.25 seconds

Let’s evaluate the results:

print("{:<20} {:<20}".format("Model", "Runtime"))
for result in benchmark_result:
    name, runtime = result
    print("{:<20} {:<20}".format(name, runtime))
Model                 Runtime             
Standard              8.17177152633667    
Mixed Precision       6.022329568862915   
XLA                   6.265935659408569   
XLA + Mixed Precision 4.252242088317871   

The fully optimized model took four seconds to generate three new images from a text query on the A100 GPU.

Conclusion

Thanks to XLA, KerasCV allows you to create a new generation of Stable Diffusion. And with mixed precision and XLA, we have the fastest Stable Diffusion pipeline for September 2022.

At the end of keras.io tutorials, we usually recommend a few topics for further study. This time we will limit ourselves to one call:

Run your descriptions through this model! It’s just a bomb!

If you have an NVIDIA GPU or M1 MacBookPro, you can run the generation on your machine. (Note that when starting on the M1 MacBookPro, you do not need to activate mixed precision: Apple’s Metal does not support it very well yet).

We will teach you how to develop generative networks, work with data, so that you can upgrade your career or become a sought-after IT specialist:

To view all courses, click on the banner:

Brief catalog of courses

Data Science and Machine Learning

Python, Web development

Mobile development

Java and C#

From basics to depth

As well as

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *