The 9 Key Ideas Behind FlashAttention
FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.
Idea 1 – Loss‑less acceleration for Transformers
FlashAttention targets the Q‑K‑V attention in a Transformer decoder. Q and K are learned parameters; the non‑linear part resides in the softmax of the Q·Kᵀ matrix. By reorganising the computation, FlashAttention achieves a loss‑less speedup for the entire attention operation.
Idea 2 – GPU hardware‑pyramid speed structure
The method exploits the hierarchical memory of modern GPUs:
SM (Streaming Multiprocessors) with shared SRAM/L1 cache
L2 cache shared across SMs
High‑bandwidth memory (HBM/VRAM) as GPU DRAM
Data flow proceeds from CPU main memory → PCIe → GPU DRAM → GPU caches → Tensor‑Core scheduler. Images illustrate the memory hierarchy for NVIDIA A100 and H100 and the corresponding Tensor‑Core scheduler.
Idea 3 – Tiling to reuse SRAM
Tiles partition the attention computation so that each tile fits within the shared SRAM of an SM. Tile size is chosen based on the maximum shared‑memory capacity (M). The same tile layout is kept across devices to preserve algorithmic structure. Images show tile size selection and cross‑device consistency.
Idea 4 – Safe Softmax
Because exp(x) can overflow for large logits, FlashAttention first rescales the logits (subtracting the maximum) before applying softmax. This prevents numerical overflow and is required for the tiled implementation. The scaling follows the standard “log‑sum‑exp” trick (see arXiv:1805.02867).
Idea 5 – Online algorithm
The online algorithm keeps a limited buffer that stores globally useful intermediate values. As the data stream updates, the algorithm iteratively refines results, reducing memory usage while preserving correctness. The transformed online loop is shown in the accompanying diagram. Reference: https://arxiv.org/pdf/1805.02867
Idea 6 – Tile‑based memory layout
Q, K, V, and O all have shape N×d. SRAM of size M satisfies d < M < N, allowing four tiles (Q, K, V, O) to reside simultaneously: 4·Bc ≤ M, where Bc = M/4. This layout enables the entire attention computation using only four SRAM blocks. Diagrams illustrate the placement of the four tiles and the resulting data flow.
Idea 7 – Parallel multiplication in the online loop
During each iteration, multiplication and division are separated. Two parallel multiplication streams are maintained, and division is applied only in the final update step. This eliminates unnecessary arithmetic and speeds up the forward pass. The revised iteration diagram is included.
Idea 8 – Reduce KV slicing
K and V are read‑written twice while Q is accessed only once. By minimizing the number of KV slices and keeping KV block sizes consistent, tile‑based partitioning works correctly and memory traffic is reduced. The effect of larger KV blocks on read/write count is illustrated.
Idea 9 – Integrated backward pass
Standard back‑propagation does not exploit FlashAttention’s characteristics. By applying FlashAttention to the backward pass, intermediate variables such as the log‑sum‑exp (LSE) are cached, allowing the softmax gradient to be computed as exp(zᵢ − LSE). Only the LSE value needs to be stored, reducing cache usage and avoiding redundant calculations. Diagrams show the forward‑backward integration.
Overall, these nine ideas combine to deliver high‑performance, memory‑efficient attention for large models, and they continue to evolve with newer GPU hardware.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI2ML AI to Machine Learning
Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
