Tag

continuous batching

1 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Hardware OptimizationInference Accelerationcontinuous batching
0 likes · 21 min read
Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies
DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationGPUKVCache
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework