Nemotron 3 Super: How Nvidia’s Hybrid Mamba‑Transformer Beats Multi‑Agent Bottlenecks
Nvidia’s newly released Nemotron 3 Super combines a 120 billion‑parameter hybrid Mamba‑Transformer architecture with latent MoE routing, multi‑token prediction and native 4‑bit quantization on Blackwell GPUs, delivering up to five‑fold throughput, 85.6% accuracy on the PinchBench benchmark and fully open‑source weights, datasets and training recipes for large‑scale multi‑agent AI workloads.
Nvidia announced Nemotron 3 Super, an open‑source large language model designed specifically for multi‑agent reasoning. The model has a total of 120 billion parameters but activates only 12 billion during inference, striking a balance between capacity and efficiency.
The architecture is a hybrid of three layers: a Mamba state‑space module for the bulk of sequence processing, traditional Transformer attention layers inserted at critical depths, and a mixture‑of‑experts (MoE) component that expands effective parameter count without dense compute overhead. This design reduces token‑level latency while supporting a 1 million‑token context window.
To overcome the scaling bottleneck of conventional MoE routing, Nemotron 3 Super introduces a latent MoE mechanism. Tokens are first projected into a highly compressed low‑rank space, routed to a small subset of expert sub‑networks, processed, and then re‑projected back to the full model dimension. This yields a four‑fold increase in the number of usable experts at the same compute budget.
The model also adopts multi‑token prediction (MTP), where a dedicated head simultaneously forecasts several future tokens at each position. This forces the network to internalize long‑range dependencies and improves performance on chain‑of‑thought tasks, delivering up to a three‑fold speedup for structured generation such as code and tool calls.
Quantization is handled natively on Nvidia’s Blackwell architecture using a 4‑bit format with micro‑block scaling: each block of 16 four‑bit values shares an 8‑bit scaling factor, complemented by a global full‑precision scale. This approach dramatically cuts memory usage while preserving numerical stability, enabling a 4× inference speed increase on B200 GPUs compared with FP8 on H100.
Training proceeds in three stages: (1) massive pre‑training on a curated 10 trillion‑token corpus (25 trillion tokens total across all phases), emphasizing reasoning, instruction following, coding, security and multi‑step agent tasks; (2) supervised fine‑tuning on 700 k samples drawn from a 40 million‑sample post‑training corpus; (3) reinforcement learning across 21 complex environments built with Nvidia’s open‑source NeMo Gym, totaling over 1.2 million environment roll‑outs. The resulting model achieves 85.6% success on the PinchBench benchmark, ranking fourth globally in success rate and first among open‑source agents.
All model weights, the core dataset, and the complete training recipe are released under an open license on Hugging Face and Nvidia’s NIM platform. Developers can fine‑tune the model using LoRA on the NeMo Megatron‑Bridge, deploy with vLLM, SGLang, or TensorRT‑LLM (which includes the latent MoE kernel), and benefit from detailed deployment guides covering major inference engines.
The release aims to foster an open ecosystem for high‑capacity, low‑cost multi‑agent AI, allowing enterprises to run the model on private infrastructure with full data control while leveraging the model’s superior throughput, long‑context memory, and accurate tool‑use capabilities.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
