close
close

Pyramid Attention Broadcast: The breakthrough that enables real-time AI video

In the field of video generation, remarkable progress has been made with the introduction of diffusion transformer (DiT) models, which exhibit better quality compared to traditional convolutional neural network approaches. However, this improved quality comes at a significant cost in terms of computational resources and inference time, which limits the practical applications of these models. In response to this challenge, researchers have developed a novel method called Pyramid Attention Broadcast (PAB) to achieve real-time high-quality video generation without compromising the output quality.

Current acceleration methods for diffusion models often focus on reducing sampling steps or optimizing network architectures. However, these approaches often require additional training or compromise output quality. Some recent techniques have revisited the concept of caching to accelerate diffusion models. However, these methods are primarily designed for image generation or convolutional architectures and are therefore less suitable for video DiTs. The unique challenges of video generation, including the need for temporal coherence and the interaction of multiple attention mechanisms, require a new approach.

PAB addresses these challenges by targeting the redundancy in attention computations during diffusion. The method is based on an important observation: attention differences between adjacent diffusion steps exhibit a U-shaped pattern, with significant stability in the middle 70% of steps. This suggests significant redundancy in attention computations, which PAB exploits to improve efficiency.

The Pyramid Attention Broadcast method identifies the stable middle segment of the diffusion process where attention outputs show minimal differences between steps. It then propagates attention outputs from specific steps to subsequent steps within this stable segment, avoiding redundant computations. PAB applies different propagation ranges for different types of attention based on their stability and differences. Spatial attention, which varies the most due to high-frequency visual elements, receives the smallest propagation range. Temporal attention, which shows medium-frequency motion-related fluctuations, receives an intermediate range. Cross-attention, which is the most stable because it associates text with video content, receives the largest propagation range. In addition, the researchers introduce a parallel broadcast sequence technique for more efficient distributed inference. This approach significantly reduces the generation time and has lower communication costs compared to existing parallelization methods. By leveraging the unique properties of PAB, the parallel broadcast sequence enables more efficient, scalable distributed inference for real-time video generation.

PAB shows superior results on three state-of-the-art DiT-based video generation models: Open-Sora, Open-Sora-Plan, and Latte. The method achieves real-time generation for videos up to 720p resolution with speedups of up to 10.5x compared to baseline methods. Importantly, PAB maintains output quality while significantly reducing computational costs. The researchers' experiments show that PAB consistently delivers excellent and stable speedup on these popular open-source video DiTs. The Pyramid Attention Broadcast method achieves remarkable speedups without sacrificing output quality by identifying and exploiting redundancies in the attention mechanism. The method's ability to achieve real-time generation speeds of up to 20.6 FPS for high-resolution videos opens up new possibilities for practical applications of AI video generation. What sets PAB apart is its training-free nature, making it immediately applicable to existing models without the need for resource-intensive fine-tuning.

The development of PAB addresses a critical bottleneck in DiT-based video generation and potentially accelerates the adoption of these models in real-world scenarios where speed is of the essence. As the demand for high-quality, AI-generated video content continues to grow across industries, techniques like PAB will play a critical role in making these technologies more accessible and practical for everyday use. The researchers expect that their simple yet effective method will serve as a solid foundation and facilitate future research and application for video generation, paving the way for more efficient and versatile AI-driven video creation tools.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Þjórsárden and join our Telegram channel And LinkedInphew. If you like our work, you will Newsletters..

Don’t forget to join our 50k+ ML SubReddit

Find upcoming AI webinars here


Shreya Maji is a consulting intern at MarktechPost. She did her Bachelor of Technology from the Indian Institute of Technology (IIT), Bhubaneswar. As an AI enthusiast, she likes to stay updated with the latest developments. Shreya is particularly interested in the real-world applications of cutting-edge technology, especially in the field of data science.

🐝 Subscribe to the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…