Tagged articles

long-context inference

3 articles · Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 2, 2026 · Artificial Intelligence

OSCAR Beats TurboQuant: 2‑Bit KV‑Cache for Fast, Stable Long‑Context Inference

OSCAR presents an attention‑aware rotation scheme that compresses KV caches to true 2‑bit, cutting memory usage by up to 8× and boosting decode throughput by up to 7×, while preserving inference quality within a few points of BF16 across multiple models and long‑context benchmarks, outperforming TurboQuant.

2-bit quantizationKV cacheOSCAR
0 likes · 13 min read
OSCAR Beats TurboQuant: 2‑Bit KV‑Cache for Fast, Stable Long‑Context Inference
Old Zhang's AI Learning
Old Zhang's AI Learning
May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek-V4HiSparseSGLang
0 likes · 15 min read
DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges