11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

Alibaba Cloud Big Data AI Platform

Sep 19, 2023

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

Background Long‑context capability is becoming essential for large language models (LLMs) as it enables applications such as personalized chatbots, literary generation, and document summarization. Existing inference engines struggle with the memory and compute demands of very long sequences.

BladeLLM Overview BladeLLM is Alibaba Cloud’s LLM inference platform that aims to provide high‑performance, low‑cost LLM services. It optimizes the full inference stack and, notably, extends the maximum supported context length far beyond typical limits.

Technical Solutions

RaggedAttention Inspired by TensorFlow’s RaggedTensor, RaggedAttention stores each sequence’s key/value cache with variable length while keeping each sequence’s cache contiguous in memory. This design improves memory‑access efficiency compared with PagedAttention, at the cost of slightly higher fragmentation, offering a practical trade‑off for diverse workloads.

DNN‑based AutoTuner In dynamic‑shape inference scenarios, BladeLLM replaces costly runtime tuning with a deep‑neural‑network predictor that selects the optimal kernel schedule without actual measurement. The predictor achieves 99.39% of the performance of exhaustive tuning while reducing prediction latency to ~2 µs and using only a single CPU core, avoiding interference with GPU inference.

Performance Results

Benchmarks show BladeLLM supporting up to 70 K tokens (potentially 280 K with KV‑cache quantization) while maintaining low latency. Compared with other systems (lmDeploy, vLLM, HuggingFace Llama, LightLLM), BladeLLM delivers higher throughput and avoids hangs or OOM failures at long lengths.

Figures illustrate the token generation time growth for Llama‑2‑13B and the performance advantage of the AutoTuner over PyTorch, TorchScript, and ONNX Runtime.

Conclusion

Ultra‑long context is a critical trend for LLMs, and BladeLLM’s RaggedAttention and DNN‑based AutoTuner provide a scalable solution that balances memory efficiency and compute speed. Future work will continue to explore quantization, multi‑turn dialogue, and kernel optimizations.

long context LLM inference AI infrastructure AutoTuner RaggedAttention

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.