How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models
The article explains the design, implementation, and performance of the AMem NCCL‑Plugin, a lightweight extension to NVIDIA's NCCL that enables transparent offloading and rapid recovery of GPU memory during reinforcement‑learning training of trillion‑parameter models, detailing its architecture, APIs, benchmarks, installation steps, and integration guidelines.
Technical Overview
AMem NCCL‑Plugin is a custom extension to NVIDIA's Collective Communications Library (NCCL) that adds two memory‑management APIs, ncclPause() and ncclResume(), allowing transparent offloading of NCCL‑allocated GPU memory and fast restoration without tearing down communication groups. It targets reinforcement‑learning (RL) workloads where training and inference share the same GPU (co‑card deployment) and memory must be released quickly between tasks.
Background Problem
In RL pipelines, GPU memory is stateful and must be explicitly managed when switching from training to inference. Existing solutions either keep the memory allocated (wasting >10 GB per GPU) or destroy and rebuild NCCL groups, incurring seconds of latency. The challenge is to achieve both low memory usage and low latency.
Design and Architecture
The plugin follows a two‑layer decoupling design:
Interface Coupling Layer : Minimal patches to NCCL source modify only allocation, release, and mapping operations, preserving core NCCL logic.
Functionality Decoupling Layer : All new logic resides in a separate libamem_nccl.so library, handling metadata management, distributed reference tracking, and redo‑based recovery.
Key components include:
Metadata Management : Tracks memory addresses, cross‑rank references, and current state.
Distributed Reference Identification : Uses a Unix Domain Socket (UDS) to exchange file descriptors and group processes for accurate reference counting.
Process Group Communication : Supports multiple communication groups (e.g., 3D/4D parallelism) and distinguishes training vs. inference groups via ncclSetGroupID().
Performance Guarantees
The plugin provides three guarantees:
Source Tracing : Records which processes reference each NCCL buffer, ensuring complete release and accurate redo.
State Management : Continuously updates per‑process and per‑buffer states to guarantee consistency.
Distributed Flow : Guarantees correct offload and restore across ranks using the UDS channel.
Benchmarks on the Ring‑1T trillion‑parameter model show memory savings of >10 GB during training and >2 GB during inference, with offload‑restore latency typically under 1 second.
Installation & Build
Supported on NVIDIA GPUs with SM80+ and CUDA ≥ 12.2. Build steps (recommended Docker image nvcr.io/nvidia/pytorch:25.08-py3):
# Clone repository
git clone https://github.com/inclusionAI/asystem-amem.git
cd asystem-amem
git submodule init && git submodule update
./build.shKey compile‑time flags: NCCL_ENABLE_CUMEM=1 – enable NCCL CUMEM. AMEM_ENABLE=1 – activate AMem memory offload. AMEM_GROUPID=xxx – assign distinct group IDs for training and inference processes.
API Reference
New NCCL APIs added in nccl.h:
// Synchronous pause – releases local NCCL memory and decrements cross‑rank counters.
ncclResult_t ncclPause(ncclComm_t *comm = NULL);
// Synchronous resume – restores previously paused memory.
ncclResult_t ncclResume(ncclComm_t *comm = NULL);
// Statistics of NCCL memory usage.
ncclResult_t ncclMemStats();
// Set/Get process group identifier for correct reference tracking.
ncclResult_t ncclSetGroupID(int id);
ncclResult_t ncclGetGroupID(int *id);All calls are idempotent and must be paired across all ranks.
Framework Integration
Python wrappers (e.g., pynccl) expose the new APIs to libraries such as SGLang, vLLM, and Megatron. Example usage:
from sglang.srt.distributed.parallel_state import get_tensor_model_parallel_group
tp_group = get_tensor_model_parallel_group().pynccl_comm
if tp_group.nccl.enable_amem_nccl:
tp_group.nccl_pause()
# ... run inference ...
tp_group.nccl_resume()For co‑card deployments, set different group IDs for training and inference processes before any NCCL allocation.
Future Roadmap
Short‑term plans include supporting NCCL 2.28, deeper collaboration with the NCCL community, and targeted tests for symmetric memory. Mid‑ and long‑term goals involve extending AMem to new hardware, optimizing for agentic RL scenarios, and exploring tighter integration of communication and memory management.
References
Every Step Evolves: Scaling Reinforcement Learning for Trillion‑Scale Thinking Model, arXiv 2510.18855.
GLake: https://github.com/antgroup/glake.
Demystifying NCCL: An In‑depth Analysis of GPU Communication Protocols and Algorithms, arXiv 2507.04786.
NVIDIA NCCL 2.27 blog post.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
