KV cache — 20 Technical Articles

Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek V4GPU memoryKV cache

0 likes · 15 min read

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

Architect

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

The DeepSeek V4 technical report shows how a 1 million‑token context forces a redesign of attention, KV‑cache, optimizer, quantization and inference budgeting, turning long‑context capability from a costly showcase into a production‑ready feature for agents, search and Chinese professional tasks.

1M contextDeepSeekKV cache

0 likes · 28 min read

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

AI Tech Publishing

Apr 20, 2026 · Artificial Intelligence

How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

This article explains the mechanics of prompt‑caching for large language models, breaks down static versus dynamic context, details KV‑cache operation and its pricing, and shows how Claude Code’s 30‑minute programming session reached a 92% cache hit rate that reduced inference costs by 81%, concluding with three production‑grade design rules.

AI agentsAnthropic APIClaude Code

0 likes · 13 min read

How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

Geek Labs

Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

AttentionFlash AttentionKV cache

0 likes · 5 min read

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

Old Zhang's AI Learning

Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV cacheLLM inferencePerformance optimization

0 likes · 13 min read

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

AI Programming Lab

Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV cacheLLM pricingLarge Language Model

0 likes · 13 min read

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

AI Tech Publishing

Apr 5, 2026 · Artificial Intelligence

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

AttentionKV cacheLLM

0 likes · 7 min read

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

ShiZhen AI

Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV cacheLLM inferencePrefill

0 likes · 9 min read

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

IT Services Circle

Mar 31, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

Google’s newly released TurboQuant algorithm compresses KV‑Cache from 16‑bit to 3‑bit, slashing memory usage to one‑sixth while preserving zero accuracy loss, dramatically accelerating large‑language‑model inference on GPUs and reshaping the memory market.

AI inferenceGoogle ResearchKV cache

0 likes · 7 min read

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

ShiZhen AI

Mar 31, 2026 · Artificial Intelligence

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Google's TurboQuant paper shows KV cache compression up to 6.4× with minimal quality loss, causing DRAM and SSD stocks to tumble, while the open‑source community reproduces the method in under two days and Anthropic and OpenAI add powerful developer‑focused AI features.

AI toolchainClaude CodeKV cache

0 likes · 9 min read

Google's TurboQuant Paper Triggers Storage Stock Drops, Community Implements It in 48 Hours

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

KV cacheLLM inferenceTurboQuant

0 likes · 8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

SuanNi

Mar 26, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

TurboQuant, a new technique from Google Research, dramatically compresses key‑value caches by up to six times without precision loss, using PolarQuant and QJL algorithms to transform vectors into polar coordinates and apply quantized Johnson‑Lindenstrauss transforms, thereby boosting inference speed and enabling longer context handling for large language models.

AI compressionKV cachePerformance

0 likes · 13 min read

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

Old Zhang's AI Learning

Mar 26, 2026 · Artificial Intelligence

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Google’s TurboQuant reduces KV‑Cache memory by up to 4.6×, speeds 3‑bit attention computation up to 8× on H100, and delivers near‑zero accuracy loss across long‑context benchmarks, with open‑source implementations for Metal, vLLM and llama.cpp.

GoogleKV cacheLLM quantization

0 likes · 10 min read

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

MaGe Linux Operations

Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU memoryKV cacheLLM OOM

0 likes · 28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

AI Explorer

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV cacheLLM inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

Machine Learning Algorithms & Natural Language Processing

Feb 28, 2026 · Artificial Intelligence

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.

DeepSeekDualPathKV cache

0 likes · 13 min read

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

Machine Learning Algorithms & Natural Language Processing

Feb 27, 2026 · Artificial Intelligence

Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?

DeepSeek’s new DualPath inference framework, co‑developed with leading Chinese universities, decouples compute from KV‑Cache memory access to eliminate I/O stalls in multi‑round agentic workloads, delivering up to nearly 2× higher throughput and dramatically reducing job‑completion time across several large‑scale LLMs.

AI infrastructureAgentic InferenceDeepSeek

0 likes · 13 min read

Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?

Alibaba Cloud Developer

Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

KV cacheLLMLoad Balancing

0 likes · 15 min read

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

PaperAgent

Jan 21, 2026 · Artificial Intelligence

Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture

DeepSeek’s recent FlashMLA update introduces the new MODEL1, featuring a tighter KV-Cache layout, an extra two-stage cache, and a fixed 512×512 head dimension, with four code changes detailed in a public GitHub commit and illustrated by comparative diagrams.

AI ArchitectureDeepSeekFlashMLA

0 likes · 3 min read

Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture

AI Frontier Lectures

Jan 5, 2026 · Artificial Intelligence

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference

Tencent WeChat AI introduces WeDLM, a diffusion language model that works with standard causal attention and KV caching, achieving up to ten‑fold speedups over autoregressive models while maintaining or improving generation quality across math reasoning and open‑ended tasks.

Diffusion Language ModelKV cacheParallel Decoding

0 likes · 8 min read

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference