Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 28, 2026 · Artificial Intelligence

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.

CUDA 13DeepSeek V4FlashAttention
0 likes · 10 min read
vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 17, 2026 · Artificial Intelligence

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

DDTreeDFlashLarge Language Model Inference
0 likes · 7 min read
How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI infrastructureExpert ParallelismGPU memory optimization
0 likes · 18 min read
Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability