Stable Diffusion is a powerful image generation tool that can be used to create realistic and detailed images. However, its image outputs can sometimes be noisy and blurry. Variational autoencoder (VAE) is a technique that can be used to improve the quality of images you generate with Stable Diffusion.

Stable Diffusion Variational Autencoder (VAE) Explained

A variational autoencoder (VAE) is a technique used to improve the quality of AI generated images you create with the text-to-image model Stable Diffusion. VAE encodes the image into a latent space and then that latent space is decoded into a new, higher quality image.


What Is a VAE?

VAE encodes an image into a latent space, which is a lower-dimensional representation of the image. The latent space is then decoded into a new image, which is typically of higher quality than the original image.

There are two main types of VAEs that can be used with Stable Diffusion: exponential moving average (EMA) and mean squared error (MSE).  EMA is generally considered to be the better VAE for most applications, as it produces images that are sharper and more realistic. MSE can be used to produce images that are smoother and less noisy, but it may not be as realistic as images generated by EMA.

Three parking sign images generated with Stable Diffusion. ft-EMA (left), ft-MSE (middle), original (right)
Image of a parking sign using EMA (left), MSE (middle), original (right). | Image: Edmond Yip
Three images of a woman created using EMA (left), MSE Middle and the original (right)
Three images of a woman created using EMA (left), MSE Middle and the original (right). | Image: Edmond Yip

More on AIDream Studio: Stable Diffusion’s AI Web Art Tool Explained


When Do You Use a VAE?

Sometimes when downloading checkpoints from Civitai, it will say Baked VAE. This means the VAE is already included in the checkpoint. For example, the Dark Sushi Mix Colorful checkpoint requires using the VAE, otherwise the images come out looking foggy and desaturated. Applying VAE results in a crisper, more colorful image.

What Is Stable Diffusion?

Stable Diffusion is a text-to-image generating model that uses deep learning and diffusion methods to generate realistic images based on text inputs. 

In general, a VAE is needed for checkpoints that were trained using one. The VAE encodes images into a latent space that the model uses during training. At generation time, the model decodes points from the latent space back into images. Without the matching VAE, the model can’t properly reconstruct the images.

If a checkpoint says it requires a certain VAE, you need to use that VAE to get proper image generations. The model relies on the VAE to translate the latent space vectors into realistic images. Leaving it out results in foggy or distorted outputs.

A tutorial on how to use VAE in Stable Diffusion. | Video: Robert Jene

More on AIThe Ultimate Guide to Midjourney V5


How to Use a VAE

To use VAE with Stable Diffusion, you will need to download a VAE model and place it in the stable-diffusion-webui/models/VAE directory. You can then select the VAE model that you want to use in the Settings > Stable Diffusion > SD VAE.

Screenshot of how to apply a VAE in stable diffusion.
How to use a VAE in Stable Diffusion. | Screenshot: Edmond Yip

You also can add VAE select option to the quick settings list.

As you can see, the images generated with VAE are of higher quality than the images generated without VAE. If you are interested in using VAE with Stable Diffusion, I encourage you to try it out. You may be surprised at the results.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us