Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek V4GPU memoryKV cache
0 likes · 15 min read
Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging
Architect
Architect
Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents

The DeepSeek V4 technical report shows how a 1 million‑token context forces a redesign of attention, KV‑cache, optimizer, quantization and inference budgeting, turning long‑context capability from a costly showcase into a production‑ready feature for agents, search and Chinese professional tasks.

1M contextDeepSeekKV cache
0 likes · 28 min read
DeepSeek V4: 1M‑Token Context’s Impact on Model, Inference, Cache & Agents
AI Tech Publishing
AI Tech Publishing
Apr 20, 2026 · Artificial Intelligence

How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive

This article explains the mechanics of prompt‑caching for large language models, breaks down static versus dynamic context, details KV‑cache operation and its pricing, and shows how Claude Code’s 30‑minute programming session reached a 92% cache hit rate that reduced inference costs by 81%, concluding with three production‑grade design rules.

AI agentsAnthropic APIClaude Code
0 likes · 13 min read
How Claude Code Achieves 92% Prompt Cache Hit Rate and Cuts Costs by 81% – A Deep Dive
Geek Labs
Geek Labs
Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

AttentionFlash AttentionKV cache
0 likes · 5 min read
A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV cacheLLM inferencePerformance optimization
0 likes · 13 min read
Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference
AI Programming Lab
AI Programming Lab
Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV cacheLLM pricingLarge Language Model
0 likes · 13 min read
Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session
AI Tech Publishing
AI Tech Publishing
Apr 5, 2026 · Artificial Intelligence

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

AttentionKV cacheLLM
0 likes · 7 min read
Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference
ShiZhen AI
ShiZhen AI
Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV cacheLLM inferencePrefill
0 likes · 9 min read
How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs
SuanNi
SuanNi
Mar 26, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

TurboQuant, a new technique from Google Research, dramatically compresses key‑value caches by up to six times without precision loss, using PolarQuant and QJL algorithms to transform vectors into polar coordinates and apply quantized Johnson‑Lindenstrauss transforms, thereby boosting inference speed and enabling longer context handling for large language models.

AI compressionKV cachePerformance
0 likes · 13 min read
TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss
MaGe Linux Operations
MaGe Linux Operations
Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU memoryKV cacheLLM OOM
0 likes · 28 min read
Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues
AI Explorer
AI Explorer
Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV cacheLLM inference
0 likes · 6 min read
How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 28, 2026 · Artificial Intelligence

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.

DeepSeekDualPathKV cache
0 likes · 13 min read
How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 27, 2026 · Artificial Intelligence

Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?

DeepSeek’s new DualPath inference framework, co‑developed with leading Chinese universities, decouples compute from KV‑Cache memory access to eliminate I/O stalls in multi‑round agentic workloads, delivering up to nearly 2× higher throughput and dramatically reducing job‑completion time across several large‑scale LLMs.

AI infrastructureAgentic InferenceDeepSeek
0 likes · 13 min read
Can DeepSeek’s DualPath Break GPU Bottlenecks and Ignite an Agentic AI Surge?
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

KV cacheLLMLoad Balancing
0 likes · 15 min read
How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance
PaperAgent
PaperAgent
Jan 21, 2026 · Artificial Intelligence

Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture

DeepSeek’s recent FlashMLA update introduces the new MODEL1, featuring a tighter KV-Cache layout, an extra two-stage cache, and a fixed 512×512 head dimension, with four code changes detailed in a public GitHub commit and illustrated by comparative diagrams.

AI ArchitectureDeepSeekFlashMLA
0 likes · 3 min read
Inside DeepSeek’s FlashMLA Update: What’s New in the MODEL1 Architecture