Tagged articles
5 articles
Page 1 of 1
Woodpecker Software Testing
Woodpecker Software Testing
Apr 24, 2026 · Artificial Intelligence

Practical Guide to Optimizing Large Model Performance in Production

This guide details how enterprises can move large language models from lab to production by defining specific SLI/SLO metrics, diagnosing hidden bottlenecks such as tokenizer latency, and applying four quantifiable optimization levers that dramatically improve latency, throughput, and cost efficiency.

Continuous BatchingGPU OptimizationLoRA
0 likes · 6 min read
Practical Guide to Optimizing Large Model Performance in Production
Ops Community
Ops Community
Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Continuous BatchingGPU OptimizationOpenAI API Compatibility
0 likes · 61 min read
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelInference Acceleration
0 likes · 8 min read
Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas
Bilibili Tech
Bilibili Tech
Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingHardware OptimizationInference Acceleration
0 likes · 21 min read
Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies
DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationContinuous BatchingGPU
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework