NVFP4 — 6 Technical Articles

Apr 22, 2026 · Artificial Intelligence

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

This article evaluates the Red Hat‑produced NVFP4‑quantized Qwen3.6‑35B model deployed with vLLM inside Docker on a dual‑RTX 4090 server, presenting accuracy gains, memory usage, initialization times, GPU compatibility notes, and practical deployment recommendations.

DockerNVFP4Quantization

0 likes · 8 min read

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

Old Zhang's AI Learning

Apr 18, 2026 · Artificial Intelligence

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

NVIDIA’s Nemotron 3 Super, a 120.6 B‑parameter flagship model supporting 1 M‑token context, combines Hybrid Mamba‑Attention, LatentMoE, and Multi‑Token Prediction to achieve up to 7.5× higher inference throughput than Qwen3.5 while matching or surpassing its accuracy across a range of benchmarks.

Hybrid Mamba-AttentionLarge Language ModelLatentMoE

0 likes · 11 min read

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

Old Zhang's AI Learning

Apr 6, 2026 · Artificial Intelligence

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Ollama 0.19 replaces its inference backend with Apple’s MLX framework and adopts NVIDIA’s NVFP4 4‑bit quantization, delivering up to a 93% speed increase on M5 chips while keeping accuracy comparable to cloud‑based deployments, and adds three cache upgrades for smoother agent interactions.

Apple SiliconLLM inferenceMLX

0 likes · 10 min read

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Old Zhang's AI Learning

Mar 13, 2026 · Artificial Intelligence

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

Nvidia’s open‑source Nemotron‑3‑Super model achieves an 85.6% success rate on the PinchBench OpenClaw benchmark, ranking in the top five (the only open‑source entry), and the article explains its architecture, quantization, training pipeline, performance numbers, usage options, and practical limitations.

AI coding agentMoENVFP4

0 likes · 10 min read

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

AI Cyberspace

Jan 26, 2026 · Artificial Intelligence

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

InferenceLLMNVFP4

0 likes · 23 min read

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

Design Hub

Jan 9, 2026 · Artificial Intelligence

LTX‑2 Acceleration Secrets: Boost Speed, Stability, and Visual Quality

This article walks through practical steps to speed up LTX‑2 AI video generation—enabling the NVFP4 model, updating NVIDIA drivers and CUDA, using FP8 text encoders, and applying a custom prompt‑optimizing assistant—showing memory savings, sub‑minute rendering at 1280×720, and noticeable quality gains.

AI video generationFP8LTX-2

0 likes · 11 min read

LTX‑2 Acceleration Secrets: Boost Speed, Stability, and Visual Quality

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

NVIDIA Nemotron 3 Super: 7× Faster Than Qwen3.5 – Inside Hybrid Mamba‑Attention, LatentMoE, and MTP

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Nvidia’s New OpenClaw‑Optimized Model Cracks Top‑5 on PinchBench – Free to Use

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

LTX‑2 Acceleration Secrets: Boost Speed, Stability, and Visual Quality

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4