11 min read

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

Old Zhang's AI Learning

Feb 3, 2026

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Model Overview

Step‑3.5‑Flash is an open‑source Mixture‑of‑Experts (MoE) large language model with 196 B total parameters and 11 B active parameters per token. The MoE architecture decouples knowledge capacity from inference cost.

Key Technical Characteristics

Inference speed : 3‑way Multi‑Token Prediction (MTP‑3) predicts four tokens per step. Typical throughput 100‑300 tok/s, peak 350 tok/s on programming tasks, ~5‑6× faster than Ollama running Qwen‑3‑8B (40‑60 tok/s).

Programming and agent performance : SWE‑bench 74.4 % (≈DeepSeek‑V3.2 ~75 %, Gemini 3.0 Pro ~73 %). Claude Code data‑analysis test score 39.58 %.

Context length : 256 K tokens using a 3:1 Sliding Window Attention mixed with Full Attention (three SWA layers + one Full Attention layer).

Hardware requirements : Runs on consumer‑grade hardware such as Mac Studio M4 Max, NVIDIA DGX Spark 128 GB (20 tok/s with 256 K context), AMD AI Max+ 395. GGUF INT4‑quantized version available for llama.cpp users.

Architecture Details

Total parameters: 196 B

Active parameters per token: 11 B

Experts: 288 routing experts + 1 always‑active shared expert

Top‑8 experts activated per token (fine‑grained routing)

MTP head predicts 4 tokens simultaneously

Context length: 256 K

Attention: 3:1 Sliding Window Attention + Full Attention

Why the speed increase? Multi‑Token Prediction predicts four tokens per step; combined with speculative sampling it reduces per‑token compute.

Why the performance level? Fine‑grained routing selects the top‑8 experts, and a shared always‑active expert prevents cold‑start issues, leading to stable training (MIS‑PO algorithm shows lower gradient‑norm variance than PPO).

Deployment

vLLM (official support)

vLLM merged support in PR #33523. Install the latest nightly build and launch the service:

# Install latest nightly version
pip install -U vllm --pre \
    --index-url https://pypi.org/simple \
    --extra-index-url https://wheels.vllm.ai/nightly

# Or use Docker (recommended)
docker pull vllm/vllm-openai:nightly

# Serve the model (FP8 quantized)
vllm serve stepfun-ai/Step-3.5-Flash \
    --served-model-name step3p5-flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --hf-overrides '{"num_nextn_predict_layers": 1}' \
    --speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
    --trust-remote-code \
    --quantization fp8

vLLM currently supports a single MTP layer; full MTP‑3 is under development, so real‑world speed may be lower than the reported 350 tok/s but remains significantly faster than standard models.

SGLang (alternative)

Install and serve with SGLang:

# Install
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"

# Serve (bf16)
sglang serve --model-path stepfun-ai/Step-3.5-Flash \
    --served-model-name step3p5-flash \
    --tp-size 8 \
    --tool-call-parser step3p5 \
    --reasoning-parser step3p5 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --enable-multi-layer-eagle \
    --host 0.0.0.0 \
    --port 8000

Claude Code integration

Replace Claude’s backend by editing ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_API_KEY": "YOUR_STEPFUN_API_KEY",
    "ANTHROPIC_BASE_URL": "https://api.stepfun.ai/"
  },
  "model": "step-3.5-flash"
}

Benchmark Comparison

Step‑3.5‑Flash: 196 B total, 11 B active, SWE‑bench 74.4 %, inference speed 350 tok/s, deployment difficulty medium (8‑GPU), Apache 2.0 license.

DeepSeek‑V3.2: 671 B total, 37 B active, SWE‑bench ~75 %, speed ~100 tok/s, deployment difficulty high (large cluster), custom license.

Kimi K2.5: parameters undisclosed, SWE‑bench ~72 %, no local deployment.

Qwen3‑Max: parameters undisclosed, SWE‑bench ~70 %, moderate speed, no local deployment.

Gemini 3.0 Pro: parameters undisclosed, SWE‑bench ~73 %, fast, no local deployment.

Compared with DeepSeek‑V3.2, Step‑3.5‑Flash is roughly three times faster and has a lower deployment barrier, but requires longer token generation for comparable quality. Compared with the Ollama ecosystem, Step‑3.5‑Flash needs professional inference engines (vLLM/SGLang), increasing deployment complexity while delivering higher performance.

Known Limitations

Token efficiency: achieving Gemini‑level quality may need longer generation traces.

Domain stability: in highly specialized domains the model may produce repeated reasoning or mixed‑language output.

Long‑dialogue consistency: multi‑turn conversations can exhibit temporal or identity inconsistencies.

Use‑case focus: optimized for programming and work‑scene tasks; casual chat is not a strength.

Resources

Official blog: https://static.stepfun.com/blog/step-3.5-flash/

HuggingFace model hub: https://huggingface.co/stepfun-ai/Step-3.5-Flash

ModelScope page: https://modelscope.cn/models/stepfun-ai/Step-3.5-Flash

vLLM PR #33523: https://github.com/vllm-project/vllm/pull/33523

vLLM MoE open-source LLM Multi‑Token Prediction LLM benchmark Step-3.5-Flash

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Model Overview

Key Technical Characteristics

Architecture Details

Deployment

vLLM (official support)

SGLang (alternative)

Claude Code integration

Benchmark Comparison

Known Limitations

Resources

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Claude Code integration