On Tuesday, Stability AI launched Stable Diffusion XL Turbo, an AI image-synthesis model that can rapidly generate imagery based on a written prompt. So rapidly, in fact, that the company is billing it as “real-time” image generation, since it can also quickly transform images from a source, such as a webcam, quickly.
SDXL Turbo’s primary innovation lies in its ability to produce image outputs in a single step, a significant reduction from the 20–50 steps required by its predecessor. Stability attributes this leap in efficiency to a technique it calls Adversarial Diffusion Distillation (ADD). ADD uses score distillation, where the model learns from existing image-synthesis models, and adversarial loss, which enhances the model’s ability to differentiate between real and generated images, improving the realism of the output.
Stability detailed the model’s inner workings in a research paper released Tuesday that focuses on the ADD technique. One of the claimed advantages of SDXL Turbo is its similarity to Generative Adversarial Networks (GANs), especially in producing single-step image outputs.
SDXL Turbo images aren’t as detailed as SDXL images produced at higher step counts, so it’s not considered a replacement of the previous model. But for the speed savings involved, the results are eye-popping.
To try it out, we ran SDXL Turbo locally on an Nvidia RTX 3060 using Automatic1111 (the weights drop in just like SDXL weights), and it can generate a 3-step 1024×1024 image in about 4 seconds, versus 26.4 seconds for a 20-step SDXL image with similar detail. Smaller images generate much faster (under one second for 512×768), and of course, a beefier graphics card such as an RTX 3090 or 4090 will allow much quicker generation times as well. Contrary to Stability’s marketing, we’ve found that SDXL Turbo images have the best detail at around 3–5 steps per image.
SDXL Turbo’s generation speed is where the “real-time” claim comes in. Stability AI says that on an Nvidia A100 (a powerful AI-tuned GPU), the model can generate a 512×512 image in 207 ms, including encoding, a single de-noising step, and decoding. Speeds like that could lead to real-time generative AI video filters or experimental video game graphics generation, if coherency issues can be solved. In this context, coherency means maintaining the same subject between multiple frames or generations.
Currently, SDXL Turbo is available under a non-commercial research license, limiting its use to personal, non-commercial purposes. This move has already been met with some criticism in the Stable Diffusion community, but Stability AI has expressed openness to commercial applications and invites interested parties to get in touch for more information.
Meanwhile, Stability AI itself has faced internal management issues, with an investor recently urging CEO Emad Mostaque to resign. Stability management has reportedly been exploring a potential company sale to a larger entity, but that hasn’t slowed down Stability’s cadence of releases. Just last week, the firm announced Stable Video Diffusion, which can turn still images into short video clips.
Stability AI offers a beta demonstration of SDXL Turbo’s capabilities on its image-editing platform, Clipdrop. You can also experiment with an unofficial live demo on Hugging Face for free. Obviously all the usual caveats apply, including the lack of provenance for training data and the potential for misuse. Even with those unresolved issues, technological progress in AI image synthesis is certainly not slowing down.