Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput
The article introduces a fine‑grained activation offloading technique implemented in Megatron‑Core that offloads module‑level activations to CPU, overlaps transfer with computation, and remains compatible with pipeline and virtual pipeline parallelism, dramatically reducing peak GPU memory for large language models while incurring minimal throughput loss.
As large language models (LLMs) scale beyond a hundred billion parameters and context lengths reach 32K tokens, activation memory becomes a critical bottleneck because activation size grows quadratically with sequence length, often exceeding GPU capacity, especially in multimodal and reinforcement‑learning training.
To balance memory consumption and training efficiency, the authors propose Fine‑grained Activation Offloading , a module/operator‑level offloading scheme integrated into the Megatron‑Core framework. It works alongside pipeline parallelism (PP), virtual pipeline parallelism (VPP), and fine‑grained recomputation, achieving an optimal trade‑off between memory savings and performance loss.
Core Design Principles
Granular Offloading: Activations are offloaded at the level of individual modules (e.g., qkv_linear, core_attn, moe_act) rather than whole layers.
Compute‑Transfer Overlap: Forward offload and backward reload are overlapped with subsequent computations using separate CUDA streams (D2H/H2D), hiding data‑transfer latency.
Full‑Scenario Compatibility: Supports PP=1/PP>1/VPP>1, mixed‑precision formats (BF16, Blockwise‑FP8, MXFP8, NVFP4), 1F1B A2A overlap, CUDA Graph, and complex MoE/MLA models.
Forward Offload Logic
When a module finishes its forward computation, its input and intermediate activations are immediately offloaded to CPU memory if they are needed for the backward pass.
Offload operations run in parallel with the next module’s computation via an independent CUDA stream.
Special rule: for chunks where forward and backward passes are adjacent, the last layer’s activation is not offloaded to avoid reload latency.
Backward Reload Logic
After a module completes its backward computation, the activation for the next layer is reloaded.
This staggered reload prevents memory usage from doubling at any moment.
Reload also overlaps with the next backward computation using a separate CUDA stream.
Synchronization Rules
Offload starts only after the current module’s forward pass finishes.
Reload starts only after the current module’s backward pass finishes.
Reload begins only after the corresponding offload has completed (CPU tensor ready).
All required activations must be reloaded before the next backward computation begins.
Compatibility with Pipeline Parallelism
The scheme handles model chunks transparently; a singleton ChunkOffloadHandler queue manages offload/reload for each micro‑batch. VPP stages are queued in reverse order, ensuring correct execution order during backward passes.
Integration with Fine‑grained Recomputing
Offloading is combined with recomputation: lightweight operators (e.g., layernorm, moe_act) are recomputed, while heavy operators (e.g., core_attn, expert_fc1) are offloaded. This hybrid approach can free all activations of a transformer layer, achieving a "memory‑free" state.
Experimental Evaluation
Experiments on Megatron‑LM using models such as DeepSeek‑V3, Qwen3‑235B, and Dots2.llm evaluate two metrics: throughput and peak memory. Key findings include:
Enabling offload on selected modules yields 10‑35% peak memory reduction with only 1‑2% throughput loss.
Combining offload with fine‑grained recompute can further improve throughput (7‑10%) while preserving memory savings.
Four representative experiments (DeepSeek‑V3‑proxy on 64 GPUs, DeepSeek‑V3 on 256 GPUs, Qwen3‑235B long‑sequence training, Dots2.llm on 512 GPUs) demonstrate consistent benefits across different scales and configurations.
Practical Guidelines
Diagnose memory bottlenecks by profiling activation sizes (e.g., using PyTorch memory snapshot).
Compute the maximum offloadable tensor size: max_bytes = compute_time × bandwidth.
Configure offload modules via --offload-modules=core_attn,expert_fc1 and verify overlap with NSys timelines and memory snapshots.
Iteratively adjust offload selections and parallelism (TP/PP/CP) based on observed memory savings and performance impact.
Future Optimizations
Capture offload operations with CUDA Graph to further reduce CPU overhead.
Optimize FP8 tensor offloading to cut conversion and packing costs.
Integrate with Megatron‑FSDP for even larger models.
Support per‑layer offload policies and cross‑PP‑rank memory balancing.
The fine‑grained activation offloading solution in Megatron‑Core delivers a joint optimum of peak memory reduction and throughput improvement, enabling efficient training of trillion‑parameter LLMs and long‑context scenarios.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
