Tagged articles

32 articles

Page 1 of 1

May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20InferenceKV cache

0 likes · 12 min read

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks

The paper introduces ECHO, an elastic speculative decoding framework that treats token verification as a global budget‑scheduling problem, uses sparse confidence gating and a two‑level priority scheduler, and demonstrates up to 14.4% throughput gains for high‑concurrency LLM serving.

Inference Optimizationelastic budgetlarge language models

0 likes · 14 min read

Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks

Old Zhang's AI Learning

May 12, 2026 · Artificial Intelligence

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Unsloth adds MTP to Qwen3.6‑27B and 35B‑A3B models, delivering 1.5‑2× decoding speed gains on consumer‑grade GPUs, with ~80% draft acceptance, while providing installation steps, usage parameters, benchmark results, and guidance on suitable scenarios.

GGUFGPUMTP

0 likes · 9 min read

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Old Zhang's AI Learning

May 10, 2026 · Artificial Intelligence

DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

DFlash replaces the speculative draft model with a block‑diffusion drafter, generating 16 tokens per forward pass and achieving up to 6× speedup over baseline (2.5× over EAGLE‑3) without quality loss, while supporting a wide range of open‑source LLMs and multiple back‑ends.

Block DiffusionDFlashLLM inference

0 likes · 12 min read

DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

Lao Guo's Learning Space

May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV cache

0 likes · 11 min read

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

AI Engineer Programming

May 7, 2026 · Artificial Intelligence

How Cursor Turned Its Coding Agent from Demo to Production

The article examines Cursor's journey of shipping its Composer coding agent, detailing the agentic AI model, system architecture, and the three major production challenges—diff handling, latency accumulation, and sandbox scaling—along with the engineering solutions that enabled reliable, fast, and adoptable AI‑driven code generation.

Agentic AICoding AgentCursor

0 likes · 16 min read

How Cursor Turned Its Coding Agent from Demo to Production

Old Zhang's AI Learning

May 6, 2026 · Artificial Intelligence

Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Google’s new Multi‑Token Prediction (MTP) drafter for Gemma 4 delivers up to three‑fold inference speedups across hardware and frameworks—validated by official benchmarks and independent DGX Spark tests—while preserving identical output quality, and is immediately usable via Hugging Face, vLLM, MLX, Ollama and edge‑device runtimes.

Apple SiliconGemma 4LLM inference

0 likes · 9 min read

Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Old Zhang's AI Learning

May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek-V4HiSparseSGLang

0 likes · 15 min read

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

Old Zhang's AI Learning

Apr 17, 2026 · Artificial Intelligence

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

DDTreeDFlashacceleration

0 likes · 7 min read

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

Old Zhang's AI Learning

Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang

0 likes · 12 min read

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Old Zhang's AI Learning

Apr 7, 2026 · Artificial Intelligence

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

CPU KV offloadGPUGemma 4

0 likes · 18 min read

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

DeepHub IMBA

Apr 2, 2026 · Artificial Intelligence

Speculative Decoding Explained: Small Draft Model + One‑Shot Verification

The article details how speculative decoding—using a fast small model to draft tokens and a large model to verify them—overcomes the memory‑bandwidth bottleneck of autoregressive inference, introduces SSD’s self‑draft and tree‑verification stages, presents real‑world benchmark gains, and shows how to enable it in vLLM.

GPU memory bandwidthInference OptimizationSSD

0 likes · 14 min read

Speculative Decoding Explained: Small Draft Model + One‑Shot Verification

Machine Heart

Apr 1, 2026 · Artificial Intelligence

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

The SSD framework and its SAGUARO optimization, developed by researchers from Stanford, Princeton, and Together AI, parallelize drafting and verification in speculative decoding, eliminating serial dependencies and achieving up to 2× faster inference than the world’s strongest engines and up to 5× speedup over standard autoregressive generation, while addressing challenges such as prediction accuracy, acceptance‑rate trade‑offs, and fallback strategies.

Inference AccelerationSAGUAROSSD

0 likes · 7 min read

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

Old Zhang's AI Learning

Mar 27, 2026 · Artificial Intelligence

vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2

The March 2026 vLLM release bundle introduces four substantial upgrades—Semantic Router v0.2 Athena, NVIDIA Nemotron 3 Super, the parallel speculative decoding P‑EAGLE, and a completely re‑architected Model Runner V2—each backed by concrete benchmarks, architectural diagrams, and code examples that demonstrate how the engine evolves from a pure inference engine to a full‑stack AI serving platform.

GPU AccelerationModel Runner V2Nemotron-3-Super

0 likes · 17 min read

vLLM’s Four Major 2026 Updates: Semantic Router Athena, Nemotron 3 Super, P‑EAGLE, and Model Runner V2

Old Zhang's AI Learning

Mar 23, 2026 · Artificial Intelligence

How Large‑Model Research Is Shifting: Insights from 120 Top Papers

The article reveals that large‑model research has moved from sheer scale to deeper capabilities and multimodal integration, highlighting ten hot directions and summarizing 120 recent top‑conference papers—including Spec‑VLA, Mobile‑O, OccTENS, and latent‑CoT studies—while offering free access to the full collection.

3D occupancy modelingMultimodal AIcausal reasoning

0 likes · 7 min read

How Large‑Model Research Is Shifting: Insights from 120 Top Papers

Machine Learning Algorithms & Natural Language Processing

Mar 12, 2026 · Artificial Intelligence

Nvidia’s Nemotron 3 Super Enters OpenClaw, Rivalling Opus 4.6

Nvidia unveiled the 120‑billion‑parameter Nemotron 3 Super, featuring a Mamba‑MoE hybrid architecture, LatentMoE routing, and Multi‑Token Prediction that together deliver up to 5× higher throughput and 3× faster inference, achieve 85.6% success on OpenClaw—matching Claude Opus 4.6 and GPT‑5.4—and set new records across Pinchbench, MMLU, SWE‑Bench, and other benchmarks, all while being fully open‑sourced with its training data and RL pipelines.

AI agentsLatentMoEMamba-MoE

0 likes · 14 min read

Nvidia’s Nemotron 3 Super Enters OpenClaw, Rivalling Opus 4.6

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

Machine Learning Algorithms & Natural Language Processing

Mar 5, 2026 · Artificial Intelligence

Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang

The new Speculative Speculative Decoding (SSD) framework, built by the Mamba and FlashAttention authors, eliminates the serial draft‑verification bottleneck in LLM inference by running the draft model asynchronously, introducing a speculation cache and the Saguaro algorithm, which together deliver up to 5× speedup over autoregressive baselines and up to 2× over optimized engines on Llama‑3 and Qwen‑3, reshaping the latency‑throughput trade‑off.

Asynchronous ParallelismLLM inferencePerformance Optimization

0 likes · 9 min read

Mamba’s SSD Framework Shatters Serial Bottleneck, Outperforms vLLM and SGLang

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT

0 likes · 13 min read

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AI2ML AI to Machine Learning

Feb 4, 2026 · Artificial Intelligence

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

Google TPUInference AccelerationLanguage Model Cascades

0 likes · 8 min read

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

Tencent Technical Engineering

Jan 13, 2026 · Artificial Intelligence

Boost LLM Inference 1.9× with AngelSlim’s Speculative Decoding (Eagle3)

AngelSlim introduces a system‑wide speculative decoding framework called Eagle3 that combines lightweight draft models with parallel verification by large models, delivering up to 1.9× faster inference across LLM, vision‑language, and speech tasks while remaining open‑source and deployment‑ready.

AngelSlimEagle3LLM acceleration

0 likes · 9 min read

Boost LLM Inference 1.9× with AngelSlim’s Speculative Decoding (Eagle3)

AI2ML AI to Machine Learning

Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelInference Acceleration

0 likes · 8 min read

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Tencent Tech

Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference Accelerationearly exit

0 likes · 6 min read

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

Data Party THU

Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMMultimodal AIToken Pruning

0 likes · 14 min read

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

Wu Shixiong's Large Model Academy

Aug 20, 2025 · Artificial Intelligence

Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding

This guide walks through common large‑model interview challenges, including a hands‑on implementation of multi‑head attention with KV‑cache, the mathematical reason for scaling by sqrt(dₖ), a concise speculative decoding algorithm, and systematic debugging steps for NaN loss during training.

KV cacheLarge Model InterviewMulti‑Head Attention

0 likes · 14 min read

Mastering Large‑Model Interview Questions: MHA, KV‑Cache, Scaled Dot‑Product, and Speculative Decoding

Data Party THU

Aug 10, 2025 · Artificial Intelligence

Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation

This article evaluates whether autoregressive large language models can generate several tokens in a single inference step, describing a mask‑based multi‑token prediction framework, gated LoRA adaptation, experimental results on Tulu‑3‑8B showing up to 5.2× speedup, and discusses implications for future research.

AI efficiencyLLMMulti-token generation

0 likes · 13 min read

Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation

ELab Team

Jul 9, 2025 · Artificial Intelligence

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

This article explains the design of the edit_file tool, the fast‑apply model that rewrites whole files instead of diffs, its training and evaluation methodology, speculative decoding speed gains, and future research directions for large‑scale code‑editing AI systems.

AIModel Trainingcode editing

0 likes · 14 min read

How Fast‑Apply AI Models Revolutionize Code Editing with Speculative Decoding

AI Algorithm Path

May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization

0 likes · 18 min read

Uncovering the Secrets of LLM Inference Optimization

Architect

Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePaged AttentionPerformance Optimization

0 likes · 23 min read

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

DeWu Technology

Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed inferenceGPU Acceleration

0 likes · 22 min read

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

Meituan Technology Team

Aug 8, 2024 · Artificial Intelligence

Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning

This article reviews four ACL 2024 papers authored by Meituan’s research team—covering training cost reduction, speculative decoding, code generation optimization, and instruction fine‑tuning—while also announcing a live sharing session at the conference.

ACL 2024Code GenerationLLM

0 likes · 9 min read

Highlights of Meituan's ACL 2024 Papers: Speculative Decoding, Graph‑Structured Decoding, DolphCoder, and Instruction Fine‑tuning

NewBeeNLP

Feb 8, 2024 · Artificial Intelligence

How Speculative Decoding Supercharges Large Language Model Inference

This survey examines speculative decoding—a draft‑then‑verify technique that parallelizes token generation to cut LLM inference latency, outlines its core components, compares independent and self‑drafting methods, discusses verification strategies, and highlights open research challenges.

LLM inferenceParallelismPerformance Optimization

0 likes · 15 min read

How Speculative Decoding Supercharges Large Language Model Inference