Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.

AIWalker
AIWalker
AIWalker
Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Background and Problem

State‑of‑the‑art video DiT models rely on full 3D attention to capture spatio‑temporal relationships, causing quadratic compute cost. Generating a 5‑second 720p clip with HunyuanVideo requires 115K tokens, making attention the dominant inference bottleneck.

Redundancy in Full 3D Attention

Visualization of HunyuanVideo attention scores reveals strong 3D locality: queries mainly attend to nearby keys in space and time. Quantifying attention recall shows that a local window covering only 15.52% of the space captures 70% of total attention, indicating massive redundancy.

Why Conventional Sliding Window Attention (SWA) Fails

Although SWA works for 1D NLP sequences, extending it to 2D/3D video DiT is ineffective. Existing SWA methods (e.g., CLEAR, NATTEN) reduce FLOPs but do not translate to wall‑clock speedups because their mixed‑block patterns are hardware‑inefficient and incompatible with FlashAttention (FA) kernels.

Sliding Tile Attention (STA) Design

STA introduces a block‑wise sliding window that aligns with FA’s tile‑based computation. By defining non‑overlapping spatio‑temporal blocks (the basic compute unit) and moving the attention window block‑by‑block, STA produces only dense and empty blocks, eliminating inefficient mixed blocks.

SWA : token‑wise sliding creates irregular attention maps that GPUs struggle to process.

STA : block‑wise sliding yields dense or empty blocks, fully GPU‑friendly.

Implementation Details

Divide the video tensor into non‑overlapping blocks of size B, matching FlashAttention’s tile size.

Flatten tokens within each block; the window size must be an integer multiple of B.

Slide the attention window with stride S block‑wise; each central query block attends only to key blocks inside the window.

This results in an attention map composed solely of dense and empty blocks, removing mixed‑block overhead.

STA is implemented on top of FlexAttention, leveraging ThunderKittens and FlashAttention‑3 kernels. Threadblocks are split into compute warpgroups (handling query blocks) and data warpgroups (asynchronously loading KV blocks), allowing mask computation to be overlapped with data movement.

Kernel‑Level Optimizations

By decoupling block‑level masks from the compute path, STA achieves a memory‑utilization factor (MFU) of 41.03% versus 8.20% for Tiled NATTEN. Overall, STA delivers a 10.45× speedup over full attention kernels while maintaining 58.33% sparsity.

Forward speed of sparse attention kernels
Forward speed of sparse attention kernels

Window‑Size Calibration for Training‑Free Speedup

Because attention heads exhibit consistent locality across prompts, a small set of prompts can be used to search for the optimal window size per head. The calibration process evaluates L2 distance between sparse and full‑attention outputs, selecting masks that minimize this distance. Using this calibrated STA yields 58% sparsity and a 1.8× end‑to‑end speedup (945 s → 520 s) without quality loss.

Fine‑Tuning STA for Additional Gains

Beyond mask calibration, fixing the window and fine‑tuning STA on 8 H100 GPUs for 8 hours further improves performance. At 91% sparsity, STA achieves 5.76× FLOP reduction and 3.53× latency reduction, while VBench scores drop only from 80.58% to 82.62% after fine‑tuning.

Compatibility with Other Acceleration Techniques

STA is orthogonal to cache‑based methods such as TeaCache. Combining STA with TeaCache yields a 3× overall speedup (945 s → 317 s) with no perceptible quality degradation.

Evaluation

On the MovieGen benchmark (200 random prompts), STA‑enabled HunyuanVideo shows comparable visual quality to the original model while delivering the reported speedups.

Conclusion

The paper introduces Sliding Tile Attention, a hardware‑efficient 3D sparse attention mechanism that dramatically accelerates video diffusion models without sacrificing quality. Experiments confirm up to 10.45× kernel speedup and 3.53× end‑to‑end generation acceleration, and the method is compatible with other acceleration strategies, suggesting broad applicability to other modalities.

References

[1] Fast Video Generation with Sliding Tile Attention

[2] https://hao-ai-lab.github.io/blogs/sta/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningGPU Optimizationsparse attentionHunyuanVideoSliding Tile AttentionVideo DiT
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.