Tagged articles

11 articles

Page 1 of 1

Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA

0 likes · 10 min read

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

Baobao Algorithm Notes

Apr 27, 2026 · Artificial Intelligence

DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer

The article provides an in‑depth technical analysis of DeepSeek‑V4, detailing its novel hybrid attention architecture (CSA and HCA), the manifold‑constrained hyper‑connection (mHC), massive KV‑cache reductions, FLOPs savings across token lengths, and the Muon optimizer with Newton‑Schulz orthogonalization, all backed by concrete benchmark tables and code snippets.

DeepSeekKV cache reductionMuon optimizer

0 likes · 61 min read

DeepDive into DeepSeek‑V4: Efficient Million‑Token Context, Hybrid Attention, and Muon Optimizer

AI Large-Model Wave and Transformation Guide

Mar 28, 2026 · Artificial Intelligence

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

This article traces the evolution of sequence models from early RNN/LSTM designs through the breakthrough Transformer, its major branches, dense scaling, efficiency‑focused variants, next‑generation linear‑complexity SSMs, and finally multimodal agent architectures, highlighting each stage's strengths, weaknesses, and typical use cases.

AI ArchitectureLLMMultimodal

0 likes · 12 min read

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

Machine Learning Algorithms & Natural Language Processing

Mar 20, 2026 · Artificial Intelligence

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

This article explains how Attention Residuals (AttnRes) replace traditional residual shortcuts with layer‑wise attention, details the mathematical reformulation, design constraints, static‑Q trick, full and block variants, and presents experimental evidence of significant accuracy gains with modest overhead.

NLPNeural NetworksRMSNorm

0 likes · 11 min read

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

AI Frontier Lectures

Mar 19, 2026 · Artificial Intelligence

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

BCCB MatrixCirculant AttentionComputer Vision

0 likes · 15 min read

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

Huawei Cloud Developer Alliance

Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIHybrid ArchitectureState Space Model

0 likes · 17 min read

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

Data Party THU

Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

KV cacheRoPETensor Product Attention

0 likes · 11 min read

How Tensor Product Attention Redefines Long‑Context Transformers

AIWalker

Jan 17, 2025 · Artificial Intelligence

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.

Diffusion TransformersHigh‑Resolution Image GenerationLinear Attention

0 likes · 21 min read

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

NewBeeNLP

Aug 3, 2024 · Artificial Intelligence

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

This article reviews recent research on extending large language model context windows to millions of tokens, covering SAMBA's hybrid architecture, Contextual Position Encoding (CoPE), RoPE base length theory, Retrieval Head analysis, and the memory‑efficient Infini‑Attention mechanism.

LLM researchLarge Language Modelsefficient attention

0 likes · 10 min read

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

DataFunSummit

Jul 18, 2022 · Artificial Intelligence

Advances in Natural Language Generation: ProphetNet, Knowledge‑Enhanced Generation, Non‑Autoregressive Pre‑training, Long‑Text Modeling, and Efficient Attention

This talk presents recent year’s research on natural language generation, covering the ProphetNet pre‑trained generation model, external‑knowledge integration for generation, non‑autoregressive pre‑training (BANG), the Poolingformer long‑text architecture, EL‑attention for faster decoding, and a new multi‑task generation benchmark.

efficient attentionknowledge integrationlong‑text modeling

0 likes · 22 min read

Advances in Natural Language Generation: ProphetNet, Knowledge‑Enhanced Generation, Non‑Autoregressive Pre‑training, Long‑Text Modeling, and Efficient Attention

Meituan Technology Team

Mar 24, 2022 · Artificial Intelligence

Twins: Efficient Visual Attention Models for Vision Transformers

The Twins series, a collaboration between Meituan and the University of Adelaide, introduces conditional positional encoding and spatially separable self‑attention to improve efficiency and performance of vision transformers, achieving state‑of‑the‑art results on ImageNet, ADE20K, COCO and high‑precision map segmentation.

ADE20KCOCOConditional Positional Encoding

0 likes · 20 min read

Twins: Efficient Visual Attention Models for Vision Transformers