Nvidia Unveils Nemotron‑Nano‑9B‑v2: Tiny Open‑Source LLM with Switchable Reasoning
Nvidia’s newly released Nemotron‑Nano‑9B‑v2, a 9‑billion‑parameter open‑source LLM optimized for a single Nvidia A10 GPU, introduces a toggleable reasoning mode and budget controls, delivering up to six‑fold speed gains, multilingual support, and strong benchmark results across various tasks.
Nvidia today officially launched the latest small open‑source language model Nemotron‑Nano‑9B‑v2. The model, with 9 B parameters, achieves best‑in‑class performance, breaking new ground in inference and introducing a switchable “reasoning” toggle that lets developers balance accuracy and response speed.
Compared with the original 12 B version, Nemotron‑Nano‑9B‑v2 is heavily compressed and specially optimized for a single Nvidia A10 GPU, balancing compute cost and efficiency. According to Nvidia AI model post‑training director Oleksii Kuchiaev, the 12 B‑to‑9 B cut targets mainstream A10 deployment, supporting larger batch sizes and speeds up to six times that of same‑size Transformer models.
Pure Transformer models consume large memory and compute for ultra‑long texts. Nemotron‑Nano‑9B‑v2 uses a hybrid Nemotron‑H architecture that combines Transformer with the Mamba state‑space model (SSM), significantly reducing memory and compute for long sequences and achieving 2‑3× higher throughput. Other research groups such as AI2 are also exploring Mamba‑based models.
As a unified text‑dialogue and reasoning model, Nemotron‑Nano‑9B‑v2 supports control commands like /think and /no_think to switch “reasoning mode”. The model first generates a reasoning trace before the answer, and developers can set a “reasoning budget” to limit internal reasoning tokens, allowing flexible control of latency and accuracy in customer support, intelligent agents, etc.
In NeMo‑Skills benchmark, the model scores: AIME25 72.1 %, MATH500 97.8 %, GPQA 64.0 %, LiveCodeBench 71.1 %. It also excels on long‑text and instruction‑following tasks, surpassing common comparators such as Qwen3‑8B.
The model is available on Hugging Face and Nvidia Model Catalog, supporting multiple languages (Chinese, English, German, Spanish, French, Italian, Japanese). Nvidia states it can be used for instruction following, code generation, and more complex reasoning, suitable for fast customer‑service replies as well as high‑accuracy professional domains.
Licensing uses Nvidia’s updated open‑source model license (June 2025), explicitly permitting commercial use. Companies may deploy the model in production without extra fees or scale limits, provided they retain the license and attribution and follow trust and security guidelines.
Nemotron‑Nano‑9B‑v2 targets developers needing both deployment efficiency and inference capability. Its reasoning toggle and budget management give higher flexibility for building intelligent客服, code generation, and multilingual interaction scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
