The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
The 9 Key Ideas Behind FlashAttention

Idea 1 – Loss‑less acceleration for Transformers

FlashAttention targets the Q‑K‑V attention in a Transformer decoder. Q and K are learned parameters; the non‑linear part resides in the softmax of the Q·Kᵀ matrix. By reorganising the computation, FlashAttention achieves a loss‑less speedup for the entire attention operation.

Idea 2 – GPU hardware‑pyramid speed structure

The method exploits the hierarchical memory of modern GPUs:

SM (Streaming Multiprocessors) with shared SRAM/L1 cache

L2 cache shared across SMs

High‑bandwidth memory (HBM/VRAM) as GPU DRAM

Data flow proceeds from CPU main memory → PCIe → GPU DRAM → GPU caches → Tensor‑Core scheduler. Images illustrate the memory hierarchy for NVIDIA A100 and H100 and the corresponding Tensor‑Core scheduler.

Idea 3 – Tiling to reuse SRAM

Tiles partition the attention computation so that each tile fits within the shared SRAM of an SM. Tile size is chosen based on the maximum shared‑memory capacity (M). The same tile layout is kept across devices to preserve algorithmic structure. Images show tile size selection and cross‑device consistency.

Idea 4 – Safe Softmax

Because exp(x) can overflow for large logits, FlashAttention first rescales the logits (subtracting the maximum) before applying softmax. This prevents numerical overflow and is required for the tiled implementation. The scaling follows the standard “log‑sum‑exp” trick (see arXiv:1805.02867).

Idea 5 – Online algorithm

The online algorithm keeps a limited buffer that stores globally useful intermediate values. As the data stream updates, the algorithm iteratively refines results, reducing memory usage while preserving correctness. The transformed online loop is shown in the accompanying diagram. Reference: https://arxiv.org/pdf/1805.02867

Idea 6 – Tile‑based memory layout

Q, K, V, and O all have shape N×d. SRAM of size M satisfies d < M < N, allowing four tiles (Q, K, V, O) to reside simultaneously: 4·Bc ≤ M, where Bc = M/4. This layout enables the entire attention computation using only four SRAM blocks. Diagrams illustrate the placement of the four tiles and the resulting data flow.

Idea 7 – Parallel multiplication in the online loop

During each iteration, multiplication and division are separated. Two parallel multiplication streams are maintained, and division is applied only in the final update step. This eliminates unnecessary arithmetic and speeds up the forward pass. The revised iteration diagram is included.

Idea 8 – Reduce KV slicing

K and V are read‑written twice while Q is accessed only once. By minimizing the number of KV slices and keeping KV block sizes consistent, tile‑based partitioning works correctly and memory traffic is reduced. The effect of larger KV blocks on read/write count is illustrated.

Idea 9 – Integrated backward pass

Standard back‑propagation does not exploit FlashAttention’s characteristics. By applying FlashAttention to the backward pass, intermediate variables such as the log‑sum‑exp (LSE) are cached, allowing the softmax gradient to be computed as exp(zᵢ − LSE). Only the LSE value needs to be stored, reducing cache usage and avoiding redundant calculations. Diagrams show the forward‑backward integration.

Overall, these nine ideas combine to deliver high‑performance, memory‑efficient attention for large models, and they continue to evolve with newer GPU hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

transformerFlashAttentionAttention MechanismGPU OptimizationTilingOnline AlgorithmSafe Softmax
AI2ML AI to Machine Learning
Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.