Old Zhang's AI Learning
Feb 3, 2026 · Artificial Intelligence
Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)
Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.
LLM benchmarkMoEMulti‑Token Prediction
0 likes · 11 min read
