Tagged articles
11 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 27, 2026 · Artificial Intelligence

DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer

The article provides an in‑depth technical analysis of DeepSeek‑V4, detailing its novel hybrid attention architecture (CSA and HCA), the manifold‑constrained hyper‑connection (mHC), massive KV‑cache reductions, FLOPs savings across token lengths, and the Muon optimizer with Newton‑Schulz orthogonalization, all backed by concrete benchmark tables and code snippets.

DeepSeekKV cache reductionMuon optimizer
0 likes · 61 min read
DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
Mar 28, 2026 · Artificial Intelligence

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

This article traces the evolution of sequence models from early RNN/LSTM designs through the breakthrough Transformer, its major branches, dense scaling, efficiency‑focused variants, next‑generation linear‑complexity SSMs, and finally multimodal agent architectures, highlighting each stage's strengths, weaknesses, and typical use cases.

AI ArchitectureLLMMultimodal
0 likes · 12 min read
From RNNs to Multimodal Agents: A Decade of Transformer Evolution
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 20, 2026 · Artificial Intelligence

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

This article explains how Attention Residuals (AttnRes) replace traditional residual shortcuts with layer‑wise attention, details the mathematical reformulation, design constraints, static‑Q trick, full and block variants, and presents experimental evidence of significant accuracy gains with modest overhead.

NLPNeural NetworksRMSNorm
0 likes · 11 min read
Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

BCCB MatrixCirculant AttentionComputer Vision
0 likes · 15 min read
Can Circulant Attention Reduce Vision Transformer Cost by 7×?
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIHybrid ArchitectureState Space Model
0 likes · 17 min read
Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling
Data Party THU
Data Party THU
Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

KV cacheRoPETensor Product Attention
0 likes · 11 min read
How Tensor Product Attention Redefines Long‑Context Transformers
AIWalker
AIWalker
Jan 17, 2025 · Artificial Intelligence

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.

Diffusion TransformersHigh‑Resolution Image GenerationLinear Attention
0 likes · 21 min read
How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion
NewBeeNLP
NewBeeNLP
Aug 3, 2024 · Artificial Intelligence

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

This article reviews recent research on extending large language model context windows to millions of tokens, covering SAMBA's hybrid architecture, Contextual Position Encoding (CoPE), RoPE base length theory, Retrieval Head analysis, and the memory‑efficient Infini‑Attention mechanism.

LLM researchLarge Language Modelsefficient attention
0 likes · 10 min read
Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention
DataFunSummit
DataFunSummit
Jul 18, 2022 · Artificial Intelligence

Advances in Natural Language Generation: ProphetNet, Knowledge‑Enhanced Generation, Non‑Autoregressive Pre‑training, Long‑Text Modeling, and Efficient Attention

This talk presents recent year’s research on natural language generation, covering the ProphetNet pre‑trained generation model, external‑knowledge integration for generation, non‑autoregressive pre‑training (BANG), the Poolingformer long‑text architecture, EL‑attention for faster decoding, and a new multi‑task generation benchmark.

efficient attentionknowledge integrationlong‑text modeling
0 likes · 22 min read
Advances in Natural Language Generation: ProphetNet, Knowledge‑Enhanced Generation, Non‑Autoregressive Pre‑training, Long‑Text Modeling, and Efficient Attention
Meituan Technology Team
Meituan Technology Team
Mar 24, 2022 · Artificial Intelligence

Twins: Efficient Visual Attention Models for Vision Transformers

The Twins series, a collaboration between Meituan and the University of Adelaide, introduces conditional positional encoding and spatially separable self‑attention to improve efficiency and performance of vision transformers, achieving state‑of‑the‑art results on ImageNet, ADE20K, COCO and high‑precision map segmentation.

ADE20KCOCOConditional Positional Encoding
0 likes · 20 min read
Twins: Efficient Visual Attention Models for Vision Transformers