Tagged articles
13 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDInference OptimizationKVCache
0 likes · 7 min read
Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage
0 likes · 20 min read
How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 6, 2026 · Artificial Intelligence

How Tair‑KVCache‑HiSim Simulates LLM Inference 390 000× Faster with <5% Error

This article explains the design, challenges, and high‑fidelity architecture of Tair‑KVCache‑HiSim, a simulation tool that models multi‑level KV‑Cache behavior for large‑language‑model inference, predicts latency, throughput and cost under SLO constraints, and validates its predictions against real GPU deployments with sub‑5% error.

AI InfrastructureCost OptimizationKVCache
0 likes · 32 min read
How Tair‑KVCache‑HiSim Simulates LLM Inference 390 000× Faster with <5% Error
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 29, 2025 · Artificial Intelligence

How Alibaba’s Tair KVCache Manager Revolutionizes Enterprise‑Level LLM Cache Management

This article details the architecture and implementation of Tair KVCache Manager, an enterprise‑grade service that centralises KVCache metadata, decouples inference engines from storage, provides elastic scaling, multi‑tenant isolation, high availability, and performance‑optimised cache management for large‑scale LLM inference workloads.

Cache ManagementKVCacheLLM
0 likes · 28 min read
How Alibaba’s Tair KVCache Manager Revolutionizes Enterprise‑Level LLM Cache Management
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI InfrastructureKVCacheKubernetes
0 likes · 21 min read
Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 23, 2025 · Artificial Intelligence

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.

Inference OptimizationKVCacheSGLang
0 likes · 22 min read
How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 17, 2025 · Cloud Native

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

This article details the design and engineering of the 3FS distributed file system as a scalable KVCache backend for large‑language‑model inference, covering its architecture, performance tuning, reliability fixes, integration with SGLang/vLLM, and cloud‑native Kubernetes operator deployment.

3FSAI inferenceCloud Native
0 likes · 30 min read
How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI InfrastructureKVCacheLLM inference
0 likes · 30 min read
How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 14, 2025 · Artificial Intelligence

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

AI PerformanceAlibaba CloudDistributed Systems
0 likes · 7 min read
How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost
DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationContinuous BatchingGPU
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 29, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

KVCacheTsinghua Universitylarge-model inference
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 28, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

Mooncake, an open-source KVCache-centric inference architecture co-developed by Alibaba Cloud and Tsinghua University's MADSys lab, dramatically improves large-model throughput and reduces cost by decoupling resources, standardizing cache pooling, and integrating with frameworks like vLLM, sparking broad industry interest.

AI InfrastructureKVCachelarge language models
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference
Architect
Architect
Jul 2, 2024 · Artificial Intelligence

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.

AI systemsDecodeKVCache
0 likes · 9 min read
Mooncake: A Separated Architecture for Large‑Language‑Model Inference