How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%
The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.
Background and Motivation
DeepSeek‑V3 and its variants (R1, V3.1, V3.2) suffer from low FFN compute density and poor GPU utilization when deployed with traditional DP‑Attention + EP (expert parallel) pipelines. The limited token batch size per decode request and the sparse MoE architecture further reduce per‑expert workload, making it difficult to increase batch size without violating service‑level objectives (SLOs).
AFD Concept (Attention‑FFN Disaggregation)
AFD decouples the self‑attention (A) and feed‑forward network (F) stages, deploying them on separate instances. The A instance runs DP‑parallel attention, while the F instance runs EP‑parallel FFN. By allowing larger batch sizes for the FFN stage, compute density improves. Two‑stage cross‑machine communication reduces RDMA traffic by 4‑8×, eliminating bandwidth bottlenecks and delivering up to 19% throughput gain in a 100 ms SLO scenario.
Baseline Deployment vs. AFD Deployment
In Baidu Intelligent Cloud, the baseline uses DP‑Attention + EP for the whole model. AFD introduces separate A and F instances, enabling larger micro‑batch sizes for FFN and employing a three‑batch overlap (3BO) scheduler to keep the A instance busy while the F instance processes previous batches.
Scheduling and Overlap Strategies
Two‑batch overlap (2BO) cannot fully hide the latency of the FFN stage. AFD adopts 3BO, inserting an additional micro‑batch so that the total latency of dispatch + combine is covered by the next attention computation. The scheduler arranges operator stages horizontally for each micro‑batch and vertically across batches, ensuring that dispatch and combine operators execute before the corresponding MLP computation completes.
Self‑attention latency must be ≥ FFN compute latency.
Self‑attention latency must be ≥ total communication latency (dispatch + combine).
Two‑Stage Communication Design
The communication library is split into two phases:
Phase 1: Tokens are sent from each A instance to the same‑rank FFN GPU (e.g., GPU 0 on A to GPU 0 on F) via RDMA.
Phase 2: Within the F node, tokens are routed to the selected expert GPUs using the internal scale‑up network.
This reduces the amount of data transferred by a factor of 4‑8, eliminating the bandwidth bottleneck observed in the baseline.
Implementation Details
AFD extends the SGLang framework with new 3BO operators and scheduling logic. Operator stages are declared with explicit YieldOperation markers to define boundaries. Empty (nop) stages are inserted to align timing and avoid GPU idle windows. The dispatch operator transfers hidden states and token metadata, while the combine operator aggregates expert results and returns them to the A instance.
Evaluation
Experiments were conducted on 4 × 8‑GPU nodes using DP‑32 + EP‑32 baseline and DP‑32 + EP‑16 AFD configurations. Three datasets (low‑latency online, near‑line, and synthetic benchmarks) with varying input lengths and SLO constraints were used. Results show that AFD matches baseline performance in low‑latency scenarios and delivers up to 19% throughput improvement under a 100 ms SLO.
Challenges and Future Work
Remaining challenges include increased CUDA graph launch latency on the A instance, higher scheduling complexity for multi‑micro‑batch overlap, and expert redundancy/conflicts in the FFN stage. Ongoing work focuses on MTP + asynchronous scheduling, enhancing the scheduler’s micro‑batch awareness, and extending AFD to models with lower MoE sparsity.
Overall, AFD demonstrates that decoupling attention and FFN, combined with careful scheduling and communication redesign, can significantly improve inference throughput and GPU utilization for large MoE models.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
