Tagged articles
32 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20InferenceKV cache
0 likes · 12 min read
vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 14, 2026 · Artificial Intelligence

Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks

The paper introduces ECHO, an elastic speculative decoding framework that treats token verification as a global budget‑scheduling problem, uses sparse confidence gating and a two‑level priority scheduler, and demonstrates up to 14.4% throughput gains for high‑concurrency LLM serving.

Inference Optimizationelastic budgetlarge language models
0 likes · 14 min read
Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks
Lao Guo's Learning Space
Lao Guo's Learning Space
May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV cache
0 likes · 11 min read
Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference
AI Engineer Programming
AI Engineer Programming
May 7, 2026 · Artificial Intelligence

How Cursor Turned Its Coding Agent from Demo to Production

The article examines Cursor's journey of shipping its Composer coding agent, detailing the agentic AI model, system architecture, and the three major production challenges—diff handling, latency accumulation, and sandbox scaling—along with the engineering solutions that enabled reliable, fast, and adoptable AI‑driven code generation.

Agentic AICoding AgentCursor
0 likes · 16 min read
How Cursor Turned Its Coding Agent from Demo to Production
Old Zhang's AI Learning
Old Zhang's AI Learning
May 6, 2026 · Artificial Intelligence

Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Google’s new Multi‑Token Prediction (MTP) drafter for Gemma 4 delivers up to three‑fold inference speedups across hardware and frameworks—validated by official benchmarks and independent DGX Spark tests—while preserving identical output quality, and is immediately usable via Hugging Face, vLLM, MLX, Ollama and edge‑device runtimes.

Apple SiliconGemma 4LLM inference
0 likes · 9 min read
Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support
Old Zhang's AI Learning
Old Zhang's AI Learning
May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek-V4HiSparseSGLang
0 likes · 15 min read
DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 17, 2026 · Artificial Intelligence

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

DDTreeDFlashacceleration
0 likes · 7 min read
How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang
0 likes · 12 min read
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 7, 2026 · Artificial Intelligence

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

CPU KV offloadGPUGemma 4
0 likes · 18 min read
vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload
DeepHub IMBA
DeepHub IMBA
Apr 2, 2026 · Artificial Intelligence

Speculative Decoding Explained: Small Draft Model + One‑Shot Verification

The article details how speculative decoding—using a fast small model to draft tokens and a large model to verify them—overcomes the memory‑bandwidth bottleneck of autoregressive inference, introduces SSD’s self‑draft and tree‑verification stages, presents real‑world benchmark gains, and shows how to enable it in vLLM.

GPU memory bandwidthInference OptimizationSSD
0 likes · 14 min read
Speculative Decoding Explained: Small Draft Model + One‑Shot Verification
Machine Heart
Machine Heart
Apr 1, 2026 · Artificial Intelligence

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

The SSD framework and its SAGUARO optimization, developed by researchers from Stanford, Princeton, and Together AI, parallelize drafting and verification in speculative decoding, eliminating serial dependencies and achieving up to 2× faster inference than the world’s strongest engines and up to 5× speedup over standard autoregressive generation, while addressing challenges such as prediction accuracy, acceptance‑rate trade‑offs, and fallback strategies.

Inference AccelerationSAGUAROSSD
0 likes · 7 min read
SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 27, 2026 · Artificial Intelligence

vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2

The March 2026 vLLM release bundle introduces four substantial upgrades—Semantic Router v0.2 Athena, NVIDIA Nemotron 3 Super, the parallel speculative decoding P‑EAGLE, and a completely re‑architected Model Runner V2—each backed by concrete benchmarks, architectural diagrams, and code examples that demonstrate how the engine evolves from a pure inference engine to a full‑stack AI serving platform.

GPU AccelerationModel Runner V2Nemotron-3-Super
0 likes · 17 min read
vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 23, 2026 · Artificial Intelligence

How Large‑Model Research Is Shifting: Insights from 120 Top Papers

The article reveals that large‑model research has moved from sheer scale to deeper capabilities and multimodal integration, highlighting ten hot directions and summarizing 120 recent top‑conference papers—including Spec‑VLA, Mobile‑O, OccTENS, and latent‑CoT studies—while offering free access to the full collection.

3D occupancy modelingMultimodal AIcausal reasoning
0 likes · 7 min read
How Large‑Model Research Is Shifting: Insights from 120 Top Papers
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 12, 2026 · Artificial Intelligence

Nvidia’s Nemotron 3 Super Enters OpenClaw, Rivalling Opus 4.6

Nvidia unveiled the 120‑billion‑parameter Nemotron 3 Super, featuring a Mamba‑MoE hybrid architecture, LatentMoE routing, and Multi‑Token Prediction that together deliver up to 5× higher throughput and 3× faster inference, achieve 85.6% success on OpenClaw—matching Claude Opus 4.6 and GPT‑5.4—and set new records across Pinchbench, MMLU, SWE‑Bench, and other benchmarks, all while being fully open‑sourced with its training data and RL pipelines.

AI agentsLatentMoEMamba-MoE
0 likes · 14 min read
Nvidia’s Nemotron 3 Super Enters OpenClaw, Rivalling Opus 4.6
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention
0 likes · 12 min read
vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 5, 2026 · Artificial Intelligence

Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang

The new Speculative Speculative Decoding (SSD) framework, built by the Mamba and FlashAttention authors, eliminates the serial draft‑verification bottleneck in LLM inference by running the draft model asynchronously, introducing a speculation cache and the Saguaro algorithm, which together deliver up to 5× speedup over autoregressive baselines and up to 2× over optimized engines on Llama‑3 and Qwen‑3, reshaping the latency‑throughput trade‑off.

Asynchronous ParallelismLLM inferencePerformance Optimization
0 likes · 9 min read
Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT
0 likes · 13 min read
A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Feb 4, 2026 · Artificial Intelligence

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

Google TPUInference AccelerationLanguage Model Cascades
0 likes · 8 min read
Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelInference Acceleration
0 likes · 8 min read
Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas
Tencent Tech
Tencent Tech
Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference Accelerationearly exit
0 likes · 6 min read
How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×
Data Party THU
Data Party THU
Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMMultimodal AIToken Pruning
0 likes · 14 min read
TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Aug 20, 2025 · Artificial Intelligence

Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding

This guide walks through common large‑model interview challenges, including a hands‑on implementation of multi‑head attention with KV‑cache, the mathematical reason for scaling by sqrt(dₖ), a concise speculative decoding algorithm, and systematic debugging steps for NaN loss during training.

KV cacheLarge Model InterviewMulti‑Head Attention
0 likes · 14 min read
Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding
Data Party THU
Data Party THU
Aug 10, 2025 · Artificial Intelligence

Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation

This article evaluates whether autoregressive large language models can generate several tokens in a single inference step, describing a mask‑based multi‑token prediction framework, gated LoRA adaptation, experimental results on Tulu‑3‑8B showing up to 5.2× speedup, and discusses implications for future research.

AI efficiencyLLMMulti-token generation
0 likes · 13 min read
Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation
ELab Team
ELab Team
Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AIModel Trainingcode editing
0 likes · 14 min read
How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding
AI Algorithm Path
AI Algorithm Path
May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization
0 likes · 18 min read
Uncovering the Secrets of LLM Inference Optimization
Architect
Architect
Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePaged AttentionPerformance Optimization
0 likes · 23 min read
How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism
DeWu Technology
DeWu Technology
Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed inferenceGPU Acceleration
0 likes · 22 min read
Optimizing Large Model Inference: High‑Performance Frameworks and Techniques
Meituan Technology Team
Meituan Technology Team
Aug 8, 2024 · Artificial Intelligence

Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning

This article reviews four ACL 2024 papers authored by Meituan’s research team—covering training cost reduction, speculative decoding, code generation optimization, and instruction fine‑tuning—while also announcing a live sharing session at the conference.

ACL 2024Code GenerationLLM
0 likes · 9 min read
Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning
NewBeeNLP
NewBeeNLP
Feb 8, 2024 · Artificial Intelligence

How Speculative Decoding Supercharges Large Language Model Inference

This survey examines speculative decoding—a draft‑then‑verify technique that parallelizes token generation to cut LLM inference latency, outlines its core components, compares independent and self‑drafting methods, discusses verification strategies, and highlights open research challenges.

LLM inferenceParallelismPerformance Optimization
0 likes · 15 min read
How Speculative Decoding Supercharges Large Language Model Inference