Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Uncovering the Secrets of LLM Inference Optimization

Introduction

If you are building AI services on top of large language models (LLMs), cost is only one side of the problem; performance bottlenecks can make even a huge budget insufficient. The article explores how to turn LLM inference from a costly operation into a high‑throughput engine.

LLM Service Challenges

LLM inference consists of two stages:

Prefilling : the model processes the entire prompt (context, dialogue history, question) in one forward pass.

Decoding : after the prompt, the model generates tokens one by one, each new token depending on all previous ones.

Prefilling is likened to setting up a chess board, while decoding is the sequential placement of pieces.

Key Technical Obstacles

Sparsity : Many neurons in feed‑forward networks are zero‑activated. Skipping zero values can dramatically reduce compute.

Memory‑bandwidth limits : Moving data between GPUs often costs more time than the arithmetic itself, and trillion‑parameter models exceed the capacity of a single GPU.

Scheduling inefficiency : First‑come‑first‑serve (FCFS) queues cause short requests (e.g., “what time is it?”) to wait behind long ones, making queueing latency dominate overall response time.

Sequential decoding bottleneck : Because tokens cannot be processed in parallel, long replies appear token‑by‑token, which is why streaming output feels faster than waiting for a full response.

KV‑Cache Growth

Attention is the most compute‑intensive operation in LLM inference. Each new token repeats the same attention over all previous tokens. KV‑cache stores the key/value pairs from earlier steps; on a T4 GPU it can speed up GPT‑2 by 5×. However, KV‑cache consumes memory: usage rates reported between 20.4 % and 38.2 % and a personal test with Qwenvl2.0 showed a 20 % speedup on a 10 k‑image batch.

Intelligent KV‑Cache Mechanisms

Paged Attention

PagedAttention divides KV memory into fixed‑size pages that can be shared across requests, allocated on demand, and released when no longer needed, reducing internal fragmentation.

Radix‑Tree KV Cache

In computer science, a radix tree (compressed trie) stores prefixes efficiently by merging single‑child nodes.

Organizing KV entries as a radix tree enables fast lookup and cross‑request sharing of common prefixes (e.g., the prefix “ABC” shared by three requests). The algorithm’s runtime is O(n log n), lower than the O(n²) of naïve attention.

Compressed Attention (Flash MLA)

DeepSeek’s Flash Multi‑Dimensional Latent Attention (Flash MLA) projects K and V matrices into low‑rank latent vectors, stores only the compressed form, and reconstructs them on‑the‑fly during attention. This reduces cache size while preserving accuracy.

Sparse‑Query Attention (QUEST)

MIT’s QUEST paper shows that many transformer layers become highly sparse during inference; some layers reach 100 % sparsity. The authors exploit “query‑aware sparsity” by selecting the K most relevant data blocks per query. The three‑stage algorithm is:

Extreme‑value extraction: compute per‑channel min/max of each block’s keys.

Smart matching: generate element‑wise extreme keys from the query and use sign‑based rules to pick candidate blocks.

Top‑K filtering: keep only the K blocks with highest scores.

Experiments find k = 4096 optimal, achieving near‑100 % accuracy on PG‑19, password‑retrieval, and most LongBench datasets.

Speculative Decoding

First described by Andrej Karpathy and later by Google (2022), speculative decoding runs a small, fast draft model (e.g., 1‑3 B parameters) to predict several upcoming tokens, then lets the large target model verify them in one batch. If the draft’s prediction matches, all tokens are accepted; otherwise, computation rolls back to the divergence point. This technique is used by Gemini and can cut the number of expensive target‑model forward passes dramatically.

Weight Scheduling – FlexGen

FlexGen (Stanford, UC Berkeley, CMU) treats inference as a scheduling problem: dynamically load/unload weights between GPU, CPU, and disk, and overlap computation with I/O. Constraints include left‑to‑right execution, batch‑level device affinity, and memory‑capacity limits. FlexGen introduces:

Column‑first scanning instead of row‑wise scanning.

Three‑stage pipeline: preload next‑layer weights, cache previous‑batch activations/KV, and synchronize current‑batch computation.

Benchmarks show FlexGen achieving 7.32 tokens/s on OPT‑30B on a T4, versus 1.57 tokens/s for DeepSpeed and 0.62 tokens/s for HuggingFace Accelerate.

System‑Level Optimizations – FastServe

Most LLM serving stacks (vLLM, Orca) use FCFS, causing head‑of‑line (HOL) blocking where long requests dominate latency. FastServe proposes a multi‑level feedback queue (MLFQ) with Skip‑Join:

Smart pre‑prediction of the first token’s latency.

Automatic priority assignment based on the prediction.

Cache pre‑loading across queues to hide data movement.

This reduces the proportion of queueing delay from ~90 % to a much lower figure, though it may increase latency for very long high‑priority requests.

Putting the Techniques Together

vLLM is an open‑source library from UC Berkeley that integrates PageAttention, KV‑cache tricks, speculative decoding, and other optimizations. A minimal Python example shows how to call vLLM (via an OpenAI‑compatible endpoint) with the Qwen‑VL‑2.5‑7B model to generate image captions.

import base64
from openai import OpenAI

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode('utf-8')

client = OpenAI(base_url="http://64.247.196.79:8000/v1", api_key="test")
image_path = "./dog.jpg"
base64_image = encode_image(image_path)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        ]
    }],
    max_tokens=1024,
)
print(response.choices[0])

By combining sparsity exploitation, smarter KV‑caching, speculative decoding, and weight‑aware scheduling, practitioners can turn LLM inference from a costly, latency‑bound service into a high‑throughput engine suitable for production.

LLMinference optimizationSpeculative Decodingsparse attentionKV cacheFastServeFlexGen
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.