Accelerate Stable Diffusion by 2-3X

tl;dr

A practical guide to optimizing Stable Diffusion for faster inference.

Introduction

A year after its release, diffusion models are still one of the hottest topics in AI, with the recent amazing advances brought by ControlNet, MidJourney V5, and Stable Diffusion getting faster and better. AI art has made great strides and is now within everyone's reach.

However, there is still plenty of room for improvement. Models such as Stable Diffusion are still far from producing images as instantaneously as everyone would like. Today’s high latency translates in a much worse user experience for services based on diffusion models, as well as much higher computing costs.

In this blog, we will see how Stable Diffusion inference can be sped up 2 to 3 times. As an overview of the findings, below are the results of a test conducted on an NVIDIA A10 and 3090Ti GPU for the 4 most widely used versions of Stable Diffusion. We compared the performance of the base version in fp16 with that of the version using xformers and the Speedster-optimized model.

Benchmark of Stable Diffusion response time of the Vanilla model compared to optimized versions with xformers and Speedster.
Benchmark of Stable Diffusion response time of the Vanilla model compared to optimized versions with xformers and Speedster.

As we see in the graph above, the optimized attention algorithm implemented within xformers delivers considerably better performance than the base model, particularly for the most complex model (the 2.1). Moreover, Speedster enhances the model's speed even further. In fact, building on the latest version of TensorRT, Speedser outperforms both the basic version and the version using xformers in all cases tested.

Speed up Stable Diffusion with Speedster

Let's see step-by-step how to speed up stable diffusion.

First, we need to install Speedster.

 
!pip install speedster

We also have to install the deep learning compilers required for the model optimizations.

 
!python -m nebullvm.installers.auto_installer --frameworks diffusers --compilers all

Environment check (GPU only)

Please skip this section if you don't have a GPU.

If you want to optimize Stable Diffusion on a NVIDIA GPU, in order to work properly, the following requirements must be installed on your machine:

  • CUDA>=12.0
  • tensorrt>=8.6.0

From TensorRT 8.6, all the tensorrt pre-built wheels released by NVIDIA support only CUDA>=12.0. Speedster will install tensorrt>=8.6.0 automatically in the auto-installer only if it detects CUDA>=12.0, otherwise it will install tensorrt==8.5.3.1. In that case, you will have to upgrade your CUDA version and then to upgrade tensorrt to 8.6.0 or above to execute this notebook.

There should be a way to run TensorRT 8.6 also with CUDA 11, but it requires installing TensorRT in a different way, you can check this issue. Otherwise, we highly suggest to just upgrade to CUDA 12.

First of all, Let's check the CUDA version installed on the machine.

 
import torch
import subprocess

if torch.cuda.is_available():
    cuda_version = subprocess.check_output(["nvidia-smi"])
    cuda_version = int(cuda_version.decode("utf-8").split("\n")[2].split("|")[-2].split(":")[-1].strip().split(".")[0])
    assert cuda_version >= 12, ("This notebook requires CUDA>=12.0 to be executed, please upgrade your CUDA version.")

If you have CUDA<12.0, you can upgrade it at this link.

Now, let's check the tensorrt version installed on the platform. Stable Diffusion optimization is supported starting from tensorrt==8.6.0.

 
import tensorrt
from nebullvm.tools.utils import check_module_version

if torch.cuda.is_available():
    assert check_module_version(tensorrt, "8.6.0"), ("This notebook can be run only with tensorrt>=8.6.0, if using an older version you could have.

If you have an older version, after ensuring you have CUDA>=12.0 installed, you can upgrade your TensorRT version by running:

 
pip install -U tensorrt

Model and Dataset setup

Once we have ensured that the the required libraries are installed, we have to choose the version of Stable Diffusion we want to optimize, speedster officially supports the most used versions:

Other Stable Diffusion versions from the Diffusers library should work but have never been tested.

If you try a version not included among these and it works, please feel free to report it to us on Discord so we can add it to the list of supported versions. If you try a version that does not work, you can open an issue and possibly a PR on GitHub.

For this notebook, we are going to select Stable Diffusion 1.4. Let's download and load it using the diffusers API:

 
import torch
from diffusers import StableDiffusionPipeline

# Select Stable Diffusion version
model_id = "CompVis/stable-diffusion-v1-4"

device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    # On GPU we load by default the model in half precision, because it's faster and lighter.
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
    # pipe.enable_attention_slicing() # Uncomment for stable-diffusion-2.1 on gpus with 16GB of memory like V100-16GB and T4
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

Let's now create an example dataset with some random sentences, that will be used later for the optimization process.

 
input_data = [
    "a photo of an astronaut riding a horse on mars",
    "a monkey eating a banana in a forest",
    "white car on a road surrounded by palm trees",
    "a fridge full of bottles of beer",
    "madara uchiha throwing asteroids against people"
]

Optimize inference with Speedster

It's now time of improving a bit the performance in terms of speed. Let's use Speedster.


import gc

# Move the pipe back to cpu
pipe.to("cpu")

# Clean memory
torch.cuda.empty_cache()
gc.collect()

Using Speedster is very simple and straightforward! Just use the optimize_model function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

Optimization of stable diffusion requires a lot of RAM. If you are running this notebook on google colab, make sure to use the high RAM option, otherwise the kernel may crash. If the kernel crashes also when using the high RAM option, please try adding also "torchscript" to the ignore_compilers list. If running on GPU, the optimization requires at least 16GB og GPU memory to exploit the best techniques for optimizing the model, otherwise it may fail with a Memory Error.

optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "tvm"],  # Some compilers have issues with Stable Diffusion, so it's better to skip them.
    metric_drop_ths=0.2,
)

If you run the optimization on GPU, you should obtain a speedup of about 124% on the UNet. We run the optimization on a 3090Ti and here are our results:

  • Original Model (PyTorch, fp16): 51,557 ms/batch
  • Optimized Model (TensorRT, fp16): 23,055 ms/batch

In case the optimized model you obtained is not a TensorRT one, probably there was an error during the optimization. If running on Colab, it could happen that the standard gpu is not enough to run the optimization, so we suggest to select a premium gpu with more memory.

As can be seen, the compiler that allows for greater acceleration of the model is TensorRT. Speedster has integrated the latest release of TensorRT, TensorRT v8.6.0, which was recently made available by Nvidia.

In essence, TensorRT optimizes a model's mathematical coordinates to strike a balance between the smallest possible size and highest achievable accuracy for the intended system. In the latest release, one of the key updates is that now the demoDiffusion acceleration is supported out of the box in TensorRT without requiring the installation of additional plugins.

Using Speedster for diffusion models has enormous advantages over using TensorRT:

  • Speedster takes as input directly the diffusers pipeline and automatically optimize and replace the diffusion model components. There is no need for the user to manually scan the model implementation looking for the UNet component, intercept the UNet inputs for checking the loss in precision after the optimization, compile it with TensorRT and wrap the output (a TensorRT engine) into a diffusers compatible format (pytorch-like model). All this complexity is abstracted away by Speedster, you just have to enjoy the speedup.
  • Support for multiple HW devices: TensorRT is Nvidia’s proprietary compiler and it runs just on Nvidia GPUs. With Speedster, you can take advantage of faster optimizations on a wide variety of hardware devices, from CPUs to GPUs.

Save and reload the optimized model

You can easily save to disk the optimized model with the following line:


save_model(optimized_model, "model_save_path")

You can then load again the model:


optimized_model = load_model("model_save_path", pipe=pipe)

Great! Was it easy? How are the results? Do you have any comments?

Share your optimization results and thoughts with our community on Discord, where we chat about Speedster and AI acceleration.