Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×
Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.
Background and Problem
State‑of‑the‑art video DiT models rely on full 3D attention to capture spatio‑temporal relationships, causing quadratic compute cost. Generating a 5‑second 720p clip with HunyuanVideo requires 115K tokens, making attention the dominant inference bottleneck.
Redundancy in Full 3D Attention
Visualization of HunyuanVideo attention scores reveals strong 3D locality: queries mainly attend to nearby keys in space and time. Quantifying attention recall shows that a local window covering only 15.52% of the space captures 70% of total attention, indicating massive redundancy.
Why Conventional Sliding Window Attention (SWA) Fails
Although SWA works for 1D NLP sequences, extending it to 2D/3D video DiT is ineffective. Existing SWA methods (e.g., CLEAR, NATTEN) reduce FLOPs but do not translate to wall‑clock speedups because their mixed‑block patterns are hardware‑inefficient and incompatible with FlashAttention (FA) kernels.
Sliding Tile Attention (STA) Design
STA introduces a block‑wise sliding window that aligns with FA’s tile‑based computation. By defining non‑overlapping spatio‑temporal blocks (the basic compute unit) and moving the attention window block‑by‑block, STA produces only dense and empty blocks, eliminating inefficient mixed blocks.
SWA : token‑wise sliding creates irregular attention maps that GPUs struggle to process.
STA : block‑wise sliding yields dense or empty blocks, fully GPU‑friendly.
Implementation Details
Divide the video tensor into non‑overlapping blocks of size B, matching FlashAttention’s tile size.
Flatten tokens within each block; the window size must be an integer multiple of B.
Slide the attention window with stride S block‑wise; each central query block attends only to key blocks inside the window.
This results in an attention map composed solely of dense and empty blocks, removing mixed‑block overhead.
STA is implemented on top of FlexAttention, leveraging ThunderKittens and FlashAttention‑3 kernels. Threadblocks are split into compute warpgroups (handling query blocks) and data warpgroups (asynchronously loading KV blocks), allowing mask computation to be overlapped with data movement.
Kernel‑Level Optimizations
By decoupling block‑level masks from the compute path, STA achieves a memory‑utilization factor (MFU) of 41.03% versus 8.20% for Tiled NATTEN. Overall, STA delivers a 10.45× speedup over full attention kernels while maintaining 58.33% sparsity.
Window‑Size Calibration for Training‑Free Speedup
Because attention heads exhibit consistent locality across prompts, a small set of prompts can be used to search for the optimal window size per head. The calibration process evaluates L2 distance between sparse and full‑attention outputs, selecting masks that minimize this distance. Using this calibrated STA yields 58% sparsity and a 1.8× end‑to‑end speedup (945 s → 520 s) without quality loss.
Fine‑Tuning STA for Additional Gains
Beyond mask calibration, fixing the window and fine‑tuning STA on 8 H100 GPUs for 8 hours further improves performance. At 91% sparsity, STA achieves 5.76× FLOP reduction and 3.53× latency reduction, while VBench scores drop only from 80.58% to 82.62% after fine‑tuning.
Compatibility with Other Acceleration Techniques
STA is orthogonal to cache‑based methods such as TeaCache. Combining STA with TeaCache yields a 3× overall speedup (945 s → 317 s) with no perceptible quality degradation.
Evaluation
On the MovieGen benchmark (200 random prompts), STA‑enabled HunyuanVideo shows comparable visual quality to the original model while delivering the reported speedups.
Conclusion
The paper introduces Sliding Tile Attention, a hardware‑efficient 3D sparse attention mechanism that dramatically accelerates video diffusion models without sacrificing quality. Experiments confirm up to 10.45× kernel speedup and 3.53× end‑to‑end generation acceleration, and the method is compatible with other acceleration strategies, suggesting broad applicability to other modalities.
References
[1] Fast Video Generation with Sliding Tile Attention
[2] https://hao-ai-lab.github.io/blogs/sta/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
