Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)
Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.
Model Overview
Step‑3.5‑Flash is an open‑source Mixture‑of‑Experts (MoE) large language model with 196 B total parameters and 11 B active parameters per token. The MoE architecture decouples knowledge capacity from inference cost.
Key Technical Characteristics
Inference speed : 3‑way Multi‑Token Prediction (MTP‑3) predicts four tokens per step. Typical throughput 100‑300 tok/s, peak 350 tok/s on programming tasks, ~5‑6× faster than Ollama running Qwen‑3‑8B (40‑60 tok/s).
Programming and agent performance : SWE‑bench 74.4 % (≈DeepSeek‑V3.2 ~75 %, Gemini 3.0 Pro ~73 %). Claude Code data‑analysis test score 39.58 %.
Context length : 256 K tokens using a 3:1 Sliding Window Attention mixed with Full Attention (three SWA layers + one Full Attention layer).
Hardware requirements : Runs on consumer‑grade hardware such as Mac Studio M4 Max, NVIDIA DGX Spark 128 GB (20 tok/s with 256 K context), AMD AI Max+ 395. GGUF INT4‑quantized version available for llama.cpp users.
Architecture Details
Total parameters: 196 B
Active parameters per token: 11 B
Experts: 288 routing experts + 1 always‑active shared expert
Top‑8 experts activated per token (fine‑grained routing)
MTP head predicts 4 tokens simultaneously
Context length: 256 K
Attention: 3:1 Sliding Window Attention + Full Attention
Why the speed increase? Multi‑Token Prediction predicts four tokens per step; combined with speculative sampling it reduces per‑token compute.
Why the performance level? Fine‑grained routing selects the top‑8 experts, and a shared always‑active expert prevents cold‑start issues, leading to stable training (MIS‑PO algorithm shows lower gradient‑norm variance than PPO).
Deployment
vLLM (official support)
vLLM merged support in PR #33523. Install the latest nightly build and launch the service:
# Install latest nightly version
pip install -U vllm --pre \
--index-url https://pypi.org/simple \
--extra-index-url https://wheels.vllm.ai/nightly
# Or use Docker (recommended)
docker pull vllm/vllm-openai:nightly
# Serve the model (FP8 quantized)
vllm serve stepfun-ai/Step-3.5-Flash \
--served-model-name step3p5-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
--trust-remote-code \
--quantization fp8vLLM currently supports a single MTP layer; full MTP‑3 is under development, so real‑world speed may be lower than the reported 350 tok/s but remains significantly faster than standard models.
SGLang (alternative)
Install and serve with SGLang:
# Install
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
# Serve (bf16)
sglang serve --model-path stepfun-ai/Step-3.5-Flash \
--served-model-name step3p5-flash \
--tp-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000Claude Code integration
Replace Claude’s backend by editing ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_API_KEY": "YOUR_STEPFUN_API_KEY",
"ANTHROPIC_BASE_URL": "https://api.stepfun.ai/"
},
"model": "step-3.5-flash"
}Benchmark Comparison
Step‑3.5‑Flash: 196 B total, 11 B active, SWE‑bench 74.4 %, inference speed 350 tok/s, deployment difficulty medium (8‑GPU), Apache 2.0 license.
DeepSeek‑V3.2: 671 B total, 37 B active, SWE‑bench ~75 %, speed ~100 tok/s, deployment difficulty high (large cluster), custom license.
Kimi K2.5: parameters undisclosed, SWE‑bench ~72 %, no local deployment.
Qwen3‑Max: parameters undisclosed, SWE‑bench ~70 %, moderate speed, no local deployment.
Gemini 3.0 Pro: parameters undisclosed, SWE‑bench ~73 %, fast, no local deployment.
Compared with DeepSeek‑V3.2, Step‑3.5‑Flash is roughly three times faster and has a lower deployment barrier, but requires longer token generation for comparable quality. Compared with the Ollama ecosystem, Step‑3.5‑Flash needs professional inference engines (vLLM/SGLang), increasing deployment complexity while delivering higher performance.
Known Limitations
Token efficiency: achieving Gemini‑level quality may need longer generation traces.
Domain stability: in highly specialized domains the model may produce repeated reasoning or mixed‑language output.
Long‑dialogue consistency: multi‑turn conversations can exhibit temporal or identity inconsistencies.
Use‑case focus: optimized for programming and work‑scene tasks; casual chat is not a strength.
Resources
Official blog: https://static.stepfun.com/blog/step-3.5-flash/
HuggingFace model hub: https://huggingface.co/stepfun-ai/Step-3.5-Flash
ModelScope page: https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash
vLLM PR #33523: https://github.com/vllm-project/vllm/pull/33523
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
