Tagged articles

Continuous Batching

7 articles · Page 1 of 1

Jun 7, 2026 · Artificial Intelligence

Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained

The article explains why LLM inference is memory‑bound, introduces vLLM’s three core optimizations—Continuous Batching, PagedAttention, and Prefix Caching—shows how to launch a vLLM server, run Python code to benchmark performance, and examines KV‑Cache memory usage with concrete numbers.

Continuous BatchingKV cacheLLM Inference

0 likes · 11 min read

Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained

Tencent Technical Engineering

May 25, 2026 · Artificial Intelligence

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

This article walks through a two‑month source‑code study of vLLM, explaining how token‑level scheduling, continuous batching, and the Paged Attention mechanism reshape tensor dimensions to turn large‑model inference into a compute‑bound, high‑throughput process while managing GPU memory efficiently.

Continuous BatchingFlashAttentionGPU Optimization

0 likes · 29 min read

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

Woodpecker Software Testing

Apr 24, 2026 · Artificial Intelligence

Practical Guide to Optimizing Large Model Performance in Production

This guide details how enterprises can move large language models from lab to production by defining specific SLI/SLO metrics, diagnosing hidden bottlenecks such as tokenizer latency, and applying four quantifiable optimization levers that dramatically improve latency, throughput, and cost efficiency.

Continuous BatchingGPU OptimizationLoRA

0 likes · 6 min read

Practical Guide to Optimizing Large Model Performance in Production

Ops Community

Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Continuous BatchingGPU OptimizationOpenAI API Compatibility

0 likes · 61 min read

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

AI2ML AI to Machine Learning

Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelKV cache

0 likes · 8 min read

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Bilibili Tech

Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingMulti-modalOperator fusion

0 likes · 21 min read

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

DataFunSummit

Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationContinuous BatchingGPU

0 likes · 20 min read

Accelerating Large Language Model Inference with the YiNian LLM Framework