How MiniMax’s Forge Architecture Achieves 40× Faster Agent RL Training
The article details MiniMax’s Forge system, an asynchronous native Agent‑RL architecture that standardizes Agent‑LLM interaction, introduces engineering optimizations, novel scheduling, prefix‑tree merging and reward designs, enabling million‑sample daily throughput, stable reward growth and up to 40‑fold training acceleration for the MiniMax M2.5 model.
Problem Modeling
The training objective is to maximize effective training return J . Throughput (raw tokens per second) is determined by four components: Rollout, Training, Data Processing, and I/O. Sample efficiency depends on data distribution, data quality, algorithmic efficiency, and off‑policy degree. Stability and convergence are monitored via training metrics.
Challenges
Agent scalability and framework flexibility
Agent freedom is limited when the Agent is treated as a white‑box because state must be shared with the RL framework, making it difficult to model complex architectures such as dynamic context management or Multi‑Agent RL.
Token‑in‑Token‑out (TITO) couples the Agent tightly to the tokenizer, increasing engineering cost for strict token consistency in complex context‑management scenarios.
System efficiency and computational redundancy
Rollout completion time varies from seconds to hours, creating an asynchronous scheduling dilemma: strict FIFO blocks long‑tail samples, while greedy First‑Finish‑First‑Out maximizes throughput but causes uncontrolled distribution shift and crashes.
Multi‑turn Agent requests share large context prefixes; encoding‑decoding each request independently wastes compute on repeated prefixes.
Credit assignment and optimization stability
Sparse rewards over thousands of steps lead to low signal‑to‑noise ratio, high gradient variance, and unstable large‑scale training.
Long Chain‑of‑Thought (CoT) increases response length but harms user‑perceived latency, risking strong benchmark scores but poor real‑world experience.
System Architecture and Agent RL Paradigm
Forge decouples Agent execution from the training/inference engine through three core modules.
Agent : abstracts a generic Agent (white‑box or black‑box) and its environment, acting as a pure trajectory producer. By separating environment interaction from LLM generation, the Agent focuses on core logic such as context management without dealing with training or inference details.
Middleware abstraction layer
Gateway Server – a standardized communication gateway that isolates the Agent from underlying model complexity.
Data Pool – a distributed buffer that asynchronously collects trajectories and signals, enabling flexible data processing and batching.
Training and Inference Engine
Rollout Engine – dedicated to high‑throughput token generation for Agent requests.
Train Engine – a scheduler fetches data from the Data Pool, updates the Agent model, and keeps the sampling engine synchronized to ensure the Agent always explores the latest policy distribution.
Offline evaluation shows that different Agent scaffolds cause significant performance variance; the modular design allowed training hundreds of frameworks without modifying Agent code, demonstrating strong generalization.
White‑box Agent RL: Context Management Example
Context‑scene performance degradation : as interaction rounds increase, redundant inference and observations cause “attention dilution,” weakening focus on key information within the context window.
Train‑inference mismatch : using context management only at inference time shifts the data distribution, forcing the model to handle uncommon long‑context inputs and hurting performance.
CM‑driven state transition : model Context Management as an Agent action; the state transition from S(t) to S(t+1) implicitly includes context changes, integrating context adaptation into the training objective.
Adaptive inference mode : optimize policy π(θ) so the model internalizes distribution shift and prioritizes state‑critical tokens during inference.
Perceptive CM strategy : the model learns to anticipate context changes, retain task‑relevant information, and discard irrelevant context, markedly improving performance for Context‑Management Agents.
Black‑box Agent RL: Cross‑framework Robustness
Non‑intrusive integration : Forge does not require knowledge of the Agent’s internal logic; it only sends requests to the Gateway, allowing data collection and training for any black‑box Agent, including those with memory compression or history rewriting.
Multi‑framework generalization : by decoupling the training loop from the Agent’s internal state, MiniMax M2.5 adapts to numerous black‑box Agents (e.g., Opencode Agent, Truncate BC). Experiments show stable gains even on fully opaque systems.
Engineering Optimizations
Mixed Scheduling: Windowed FIFO
To balance throughput and data‑distribution consistency, a Windowed FIFO strategy is introduced, positioned between strict FIFO and greedy scheduling. Assume maximum generation concurrency N=8192, queue Q with head index H, and a visible window size W=4096.
Limited visibility : the scheduler can only fetch completed trajectories within [H, H+W].
Local greedy (inside window) : any completed trajectory inside the window can be fetched immediately, avoiding head‑of‑line blocking.
Global strict blocking (outside window) : tasks beyond the window are not fetched, even if completed.
Constrained advancement : the window slides forward only when the head task is consumed ( H→H+1), forcing the scheduler to wait for long‑tail tasks and preventing the training distribution from drifting toward easy samples.
Prefix Tree Merging
Multi‑turn Agent requests share large context prefixes, leading to redundant computation when each request is treated as an independent sample.
Shared base prefixes allow completions to be merged into a single prefix tree, even if subsequent responses diverge.
Attention‑Mask primitives (e.g., Magi Attention) represent dependencies between branches, ensuring forward computation is mathematically identical to the naive approach; loss calculation remains unchanged after un‑merging the tree back to sequence form.
The method eliminates redundant prefixes, delivering roughly 40× training acceleration and substantially reducing memory consumption.
Inference Acceleration
Dynamic MTP : introduce a Multi‑Task Prompt (MTP) head for inference acceleration; a Top‑K KL loss keeps the detached MTP head aligned with the RL policy during training.
Rollout‑side PD separation : separates policy‑distillation (PD) from MoE scheduling, providing independent parallelism, reducing latency for long‑tail samples, and maintaining a higher off‑policy ratio.
Global L3 KV Cache Pool : a shared L3 key‑value cache across instances mitigates cache eviction in multi‑round, long‑context Agent scenarios; a cost‑aware scheduler balances queue delay and cache transfer time to maximize locality without overloading instances.
Scalable Agent RL Algorithm
The system continues to use the CISPO algorithm introduced in the M1 series, adapting it from tens of thousands of Long‑CoT tokens to 200 k context tokens for Agent tasks. A Multi‑Domain mixed‑training strategy combines Reasoning, General QA, Code Agent, and General Agent tasks, alleviating forgetting and enhancing generalization.
Dense & Process Reward
Process Reward : dense supervision of intermediate behaviors (e.g., penalizing language mixing or incorrect tool calls), providing feedback beyond the final outcome.
Task Completion Time Reward : incorporates relative completion time as a signal, encouraging the Agent to exploit parallelism and choose the fastest execution path.
Reward‑to‑Go : standardizes returns for long‑horizon tasks, reducing gradient variance and stabilizing optimization.
Repository links:
Hugging Face: https://huggingface.co/MiniMaxAI/MiniMax-M2.5
GitHub: https://github.com/MiniMax-AI/MiniMax-M2.5
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
