NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)

NVIDIA’s Nemotron 3 Nano Omni 30B‑A3B‑Reasoning model, an open‑source multimodal LLM with 30 B parameters, 256K context and video‑audio‑image‑text capabilities, outperforms comparable models by up to 9.2× in video throughput, runs on consumer GPUs via 4‑bit GGUF quantization, but currently supports only English input.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)

NVIDIA released Nemotron 3 Nano Omni 30B‑A3B‑Reasoning , a 30 B‑parameter multimodal large language model that activates only 3 B parameters per inference via a Mixture‑of‑Experts (MoE) architecture. It processes video, audio, images and text, offers a 256K token context window, and includes a built‑in reasoning chain. The model is fully open‑source under the NVIDIA Open Model Agreement, which permits commercial use and redistribution.

Key specifications :

Total / activation parameters: 31 B / 3 B

Architecture: Mamba2‑Transformer hybrid with MoE

Visual encoder: C‑RADIO v4‑H

Audio encoder: Parakeet

LLM backbone: Nemotron‑3‑Nano‑30B‑A3B

Context length: up to 256 K tokens

Supported inputs: video (mp4 ≤ 2 min), audio (wav/mp3 ≤ 1 h), image, text

Outputs: text with JSON, chain‑of‑thought, tool calling, word‑level timestamps

Quantization options: BF16, FP8, NVFP4 (three tiers)

License: NVIDIA Open Model Agreement (commercial‑friendly)

Performance advantage : The combination of Mamba2 + Transformer + MoE activates only 3 B parameters, allowing the same GPU to handle higher concurrency. NVIDIA reports a 9.2× video‑task throughput and a 7.4× multi‑document throughput compared with the Qwen3‑Omni‑30B‑A3B model. The gain stems from Efficient Video Sampling (EVS) – a 3D convolutional spatio‑temporal perception module plus video‑frame pruning (e.g., --video-pruning-rate 0.5) that reduces a 1080p video to 1 FPS/128 frames (720p to 2 FPS/256 frames).

Pareto curve: multi‑document vs video system throughput comparison
Pareto curve: multi‑document vs video system throughput comparison

Benchmark rankings : The model tops six public multimodal leaderboards – MMlongbench‑Doc (long‑document understanding), OCRBenchV2 (OCR), WorldSense (video commonsense), DailyOmni (everyday multimodal), VoiceBench (speech understanding), and MediaPerf (throughput + cost). Compared with the previous Nemotron Nano VL V2, it shows improvements across vision, video, OCR, and audio metrics.

Accuracy improvement over Nemotron Nano VL V2
Accuracy improvement over Nemotron Nano VL V2

License details (NVIDIA Open Model Agreement):

✅ Permanent, worldwide, royalty‑free, irrevocable commercial use.

✅ Allows modification and distribution of derived models in source or binary form.

✅ Output ownership remains with the user.

⚠️ Must include a copy of the license when redistributing and retain copyright notices.

⚠️ License terminates if NVIDIA is sued for infringement.

⚠️ NVIDIA trademarks cannot be used for branding (except for attribution).

For small teams and individual developers this is effectively a “take‑it‑as‑is” license.

Deployment options :

Unsloth Studio (quickest)

Unsloth provides a web UI called Unsloth Studio that can run GGUF models, compare models, chat, and handle image/audio inputs.

curl -fsSL https://unsloth.ai/main/install.sh | sh
source unsloth_studio/bin/activate
unsloth studio -H 0.0.0.0 -p 8888

On Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex
& .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888

Then open http://localhost:8888 in a browser, search for “Nemotron‑3‑Nano‑Omni”, and download the desired quantized version.

llama.cpp (more control)

Compile the CUDA version of llama.cpp:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Apple Silicon users should set -DGGML_CUDA=OFF and rely on Metal.

Examples:

Pure text chat (NVIDIA‑recommended temp=1.0, top‑p=1.0):

./llama.cpp/llama-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --temp 1.0 --top-p 1.0

Mixed image + audio input (requires llama-mtmd-cli):

./llama.cpp/llama-mtmd-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --image screenshot.png \
    --audio meeting.wav \
    -p "Summarize what is shown and said. Return key actions as bullet points." \
    --temp 1.0 --top-p 1.0

Video frame sampling (llama.cpp does not ingest video directly):

mkdir -p frames
ffmpeg -i demo.mp4 -vf "fps=1/2,scale=1280:-1" frames/frame_%04d.png
FRAMES=$(python - <<'PY'
from pathlib import Path
frames = sorted(Path("frames").glob("*.png"))[:16]
print(",".join(str(x) for x in frames))
PY)
./llama.cpp/llama-mtmd-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --image "$FRAMES" \
    -p "Analyze these sampled video frames. Summarize the sequence of events." \
    --temp 1.0 --top-p 1.0

OpenAI‑compatible server (useful for downstream services):

./llama.cpp/llama-server \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --alias "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning" \
    --prio 3 --temp 1.0 --top-p 1.0 --port 8001

Python client example:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
completion = client.chat.completions.create(
    model="unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
print(completion.choices[0].message.content)
⚠️ Ollama currently cannot run the multimodal part because the visual file mmproj is not yet supported.

Official production deployment with vLLM 0.20.0

The recommended production stack is vLLM 0.20.0 (exact version required). Choose one of the Docker images:

CUDA 13.0: vllm/vllm-openai:v0.20.0 CUDA 12.9:

vllm/vllm-openai:v0.20.0-cu129
pip install vllm[audio]==0.20.0
# or
docker pull vllm/vllm-openai:v0.20.0

Start the service (single‑GPU B200/H200/H100 recommended):

vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
  --served-model-name nemotron \
  --host 0.0.0.0 --port 5000 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --trust-remote-code \
  --video-pruning-rate 0.5 \
  --media-io-kwargs '{"video": {"num_frames": 512, "fps": 1}}' \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

When using NVFP4 or FP8 quantization, add --kv-cache-dtype fp8 to save VRAM.

Platform‑specific extra arguments (as documented by NVIDIA):

RTX Pro 6000: --moe-backend triton (FlashInfer bug on this GPU)

NVFP4 with tensor‑parallel > 1: --moe-backend flashinfer_cutlass (TRTLLM_GEN MoE kernel bug)

DGX Spark (ARM64): --gpu-memory-utilization 0.70, --max-model-len 32768, --max-num-seqs 8 (shared LPDDR5X memory)

Other runtimes that already support the model: SGLang (BF16 variant), TensorRT‑LLM, and TensorRT Edge‑LLM (Jetson Thor) with accompanying cookbooks.

Personal observations

What I like :

Truly open license – commercial‑grade, zero‑friction for small teams and individuals.

3 B‑activation MoE combined with video‑frame pruning delivers up to 9× throughput, directly addressing the “always‑on agent” bottleneck.

256K context, word‑level timestamps, and tool calling enable a single model to act as a meeting assistant, video retriever, and screen‑monitor simultaneously.

Unsloth’s Day‑Zero GGUF quantization runs a 30 B model in 4‑bit using only ~25 GB RAM, making it feasible on a typical gaming laptop.

What concerns me :

English‑only support – Chinese capability is not guaranteed, requiring separate evaluation for domestic use cases.

vLLM deployment is locked to version 0.20.0; newer images cannot be used without rebuilding.

CUDA 13.2 produces garbled output (bug acknowledged by NVIDIA); use 12.9 or 13.0 instead.

Ollama does not yet support multimodal inference.

Video inputs are limited to ≤ 2 minutes; longer videos must be chunked.

Who should consider this model :

Developers building GUI/Browser/Screen‑monitor agents.

Document‑intelligence pipelines (contracts, OCR, research papers) – top‑ranked on MMlongbench‑Doc and OCRBenchV2.

Short‑video, meeting‑note, and speech‑to‑text workflows.

Chinese‑centric consumer applications – wait for further fine‑tuning or use as a base model.

I plan to use the model for two projects: (1) analyzing locally recorded screen sessions for operation‑replay insights, and (2) feeding meeting videos into an automated workflow that extracts TODO items with timestamps.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMGPUNVIDIAMultimodalGGUFUnslothNemotron
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.