NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

NVIDIA’s Nemotron 3 Super, a 120.6 B‑parameter flagship model supporting 1 M‑token context, combines Hybrid Mamba‑Attention, LatentMoE, and Multi‑Token Prediction to achieve up to 7.5× higher inference throughput than Qwen3.5 while matching or surpassing its accuracy across a range of benchmarks.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

Overview

Nemotron 3 Super is the flagship of NVIDIA’s Nemotron 3 family, with 120.6 B total parameters and 12.7 B active parameters per forward pass (121 B without embeddings). It supports up to 1 M token context and was pretrained on 25 trillion tokens.

Core innovations

Hybrid Mamba‑Attention : most layers are Mamba‑2 blocks, eliminating KV‑cache growth and keeping state size constant; a few attention layers act as global anchors, using GQA (32 query heads, 2 KV heads).

LatentMoE : reduces hidden dimension d to a latent space ℓ before routing, cutting memory bandwidth and communication by d/ℓ, then expands the number of experts (512 total, 22 active) to keep compute roughly constant while improving accuracy.

MTP (Multi‑Token Prediction) : shares parameters across multiple draft positions, allowing recursive generation of longer drafts with smoother acceptance rates.

Training precision

The model was trained entirely with NVIDIA’s 4‑bit floating‑point format (NVFP4) on 25 trillion tokens, with the last 15 % of layers and attention projections kept in BF16 for stability. Gradient underflow caused ~7 % of weights to become zero, which did not hurt final accuracy.

Post‑training pipeline

Four stages: SFT (≈7 M samples, including agentic CLI tasks and tool‑call trajectories), RL on 21 environments and 37 datasets using a new PivotRL method, SWE‑RL with containerized GitHub repositories, and RLHF with MTP healing.

Quantized inference

Two inference variants are released: FP8 for Hopper GPUs and NVFP4 for Blackwell GPUs, the latter achieving 99.8 % of BF16 baseline accuracy after a two‑hour quantization on a B200 8‑GPU node. Stochastic rounding is used to quantize Mamba state caches.

Benchmark results

On standard benchmarks Nemotron 3 Super matches or exceeds GPT‑OSS‑120B and Qwen3.5‑122B in accuracy (e.g., MMLU 86.0 vs 81.0, HumanEval 79.4 vs 70.1) while delivering 2.2× higher throughput than GPT‑OSS‑120B and 7.5× higher than Qwen3.5‑122B in an 8 k‑input / 64 k‑output setting. It also leads in long‑context tasks (RULER 1M score 91.64 vs 22.30 for GPT‑OSS‑120B).

Strengths and weaknesses

Exceptional inference throughput (up to 7.5× faster than competitors).

1 M token context with stable performance.

Fully open‑source model weights, data, and training recipe.

NVFP4 training validates low‑precision large‑scale training.

Strong agent capabilities from extensive RL training.

Slightly lower accuracy on pure reasoning tasks compared to Qwen3.5.

High memory demand due to 512‑expert MoE, optimized for NVIDIA GPUs only.

Target users

Ideal for large‑scale AI inference services that run on NVIDIA Hopper/Blackwell hardware and need extreme throughput, ultra‑long context, or advanced tool‑using agents.

Large Language ModelNVIDIAMTPNVFP4Nemotron-3-SuperHybrid Mamba-AttentionLatentMoE
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.