Tagged articles

Sparse attention

54 articles · Page 1 of 1
IT Services Circle
IT Services Circle
Jun 24, 2026 · Artificial Intelligence

Why 1M Context Length Matters: Inside GLM 5.2’s New Techniques

The article examines how 1‑million‑token context has become a standard feature in modern LLMs, explains the compute and memory challenges it brings, reviews the sparse‑attention and token‑selection tricks (including GLM 5.2’s IndexShare and LayerSplit), and outlines practical evaluation methods for measuring long‑context effectiveness.

1M contextGLM-5.2IndexShare
0 likes · 10 min read
Why 1M Context Length Matters: Inside GLM 5.2’s New Techniques
AI Engineer Programming
AI Engineer Programming
Jun 17, 2026 · Artificial Intelligence

Local LLMs Viable: Sparse Attention, MoE, KV Compression, Multi‑Token Prediction

In early 2026, open‑source local large language models become practical alternatives thanks to sparse attention, MoE routing, latent KV compression, multi‑token prediction, and 4‑bit quantization, while hardware memory shortages and benchmark gaps with closed‑source models shape their deployment choices.

4-bit quantizationKV compressionMixture of Experts
0 likes · 13 min read
Local LLMs Viable: Sparse Attention, MoE, KV Compression, Multi‑Token Prediction
AI Open-Source Efficiency Guide
AI Open-Source Efficiency Guide
Jun 12, 2026 · Artificial Intelligence

MiniMax Open-Source MSA: High‑Performance Attention Kernels Optimized for NVIDIA SM100

MiniMax Sparse Attention (MSA) is an open‑source library that delivers high‑performance dense and block‑sparse attention operators for NVIDIA SM100 GPUs by combining a Jinja‑based csrc JIT stack with a Cutlass Python DSL (CuTe‑DSL), enabling low‑precision quantization, paging support, and seamless migration from dense code.

AI KernelsCuTe-DSLCutlass
0 likes · 5 min read
MiniMax Open-Source MSA: High‑Performance Attention Kernels Optimized for NVIDIA SM100
Kuaishou Tech
Kuaishou Tech
Jun 11, 2026 · Artificial Intelligence

Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released

Keye‑VL‑2.0, an open‑source MoE multimodal foundation model, tackles hour‑level video understanding and agentic intelligence by embedding DeepSeek Sparse Attention into a GQA‑based architecture, enabling near‑lossless 256 K token context, four‑stage pre‑training, diverse RL distillation techniques, and achieving state‑of‑the‑art results on long‑video benchmarks, with weights publicly released.

MoEMultimodalRL distillation
0 likes · 8 min read
Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released
Machine Heart
Machine Heart
Jun 5, 2026 · Artificial Intelligence

Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs

The article introduces Tencent Hunyuan's Stem sparse‑attention algorithm, which reduces first‑token latency by 3.6× on 128K context LLMs by reallocating compute with Token Position Decay and Output‑Aware Metric, and validates the gains with HPC‑optimized operators that outperform existing sparse methods in extensive benchmarks.

HPC OperatorsLLM InferenceOutput-Aware Metric
0 likes · 11 min read
Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 2, 2026 · Artificial Intelligence

MiniMax M3: How a 1M‑Token, Multimodal Agent Reproduces ICLR Research and Automates Kaggle Competitions

The MiniMax M3 model combines a 1‑million‑token context window, native multimodal training and a new MiniMax Sparse Attention architecture that cuts token compute to one‑twentieth of its predecessor, achieving up to 15× faster decoding, while its interactive user‑simulator training enables fully autonomous agents that can reproduce ICLR‑2025 research and tackle Auto‑Kaggle competitions at a fraction of the cost of Western models.

Agentic AIAuto KaggleLarge Language Model
0 likes · 9 min read
MiniMax M3: How a 1M‑Token, Multimodal Agent Reproduces ICLR Research and Automates Kaggle Competitions
Machine Heart
Machine Heart
Jun 1, 2026 · Artificial Intelligence

MiniMax M3: First Open‑Source Model to Achieve the Frontier Trio – Our Three‑Task Evaluation

MiniMax M3 claims to be the first open‑source LLM that simultaneously delivers top‑tier coding/agentic ability, a 1‑million‑token context window, and native multimodal understanding, and our benchmarks on coding suites, long‑context efficiency, and multimodal tasks confirm it exceeds expectations.

1M contextMiniMax M3Multimodal AI
0 likes · 15 min read
MiniMax M3: First Open‑Source Model to Achieve the Frontier Trio – Our Three‑Task Evaluation
SuanNi
SuanNi
Jun 1, 2026 · Artificial Intelligence

MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source

MiniMax M3, a domestically developed LLM, combines a new sparse‑attention MSA architecture, native multimodal support, and million‑token context to match or surpass top closed‑source models in programming and agent benchmarks, while achieving a 9.4× speedup on FP8 GEMM and preparing for open‑source release.

AIBenchmarkingFP8 GEMM
0 likes · 12 min read
MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 29, 2026 · Artificial Intelligence

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.

Dynamic Token SelectionKernel OptimizationLLM Inference
0 likes · 17 min read
RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 28, 2026 · Artificial Intelligence

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

This article presents GQLA, a single‑author variant of MLA that eliminates three hardware‑related drawbacks of MLA, demonstrates how it achieves balanced compute‑memory performance on both high‑end H100 and more modest H20 GPUs, and details conversion methods (TransGQLA) and sparse extensions with concrete benchmark results.

GQLALLMMLA
0 likes · 16 min read
Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA
SuanNi
SuanNi
May 27, 2026 · Artificial Intelligence

Inside Grok-5 and MiniMax-M3: Massive Model Upscale and New Sparse Attention Gains

The article reveals that xAI’s upcoming Grok-5 (Grok V9-Medium) will feature a 1.5-trillion-parameter model trained with extensive Cursor programming data, while MiniMax-M3 introduces a new sparse-attention architecture that boosts pre-fill speed by 9.7× and decode speed by 15.6×, highlighting a strategic partnership between SpaceX, Cursor, and xAI.

AI modelsCursorGrok-5
0 likes · 5 min read
Inside Grok-5 and MiniMax-M3: Massive Model Upscale and New Sparse Attention Gains
Data Party THU
Data Party THU
May 10, 2026 · Artificial Intelligence

SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs

The Chinese Academy of Sciences unveils SpikingBrain 2.0‑5B, a brain‑inspired large model that uses dual‑space sparse attention and dual activation (FP8 and INT8‑Spiking) to cut training cost by over tenfold, achieve up to 15× speedup on long sequences, and match Qwen‑3 performance while drastically reducing power consumption.

Large Language ModelSparse attentionSpikingBrain2.0
0 likes · 10 min read
SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 30, 2026 · Artificial Intelligence

How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier

Traditional full‑attention cannot handle million‑token contexts due to exponential compute and memory growth, but DeepSeek V4’s Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) compress, sparsely index, and precisely compute tokens, cutting KV cache to 10% and FLOPs to 27% while enabling a 1‑M token window on a single GPU.

Attention MechanismCSAHCA
0 likes · 12 min read
How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

The article analyzes how both DeepSeek V4 and Meituan's LongCat‑2.0‑P preview, each with trillion‑scale parameters and 1 M‑token context, were trained and inferred entirely on Chinese‑made accelerators, detailing memory optimizations, deterministic operators, MoE redesigns, and massive multi‑card clusters that prove domestic compute can meet top‑tier AI workloads.

Deterministic OpsDomestic AI ChipLarge Language Model
0 likes · 13 min read
Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips
Architects' Tech Alliance
Architects' Tech Alliance
Apr 29, 2026 · Artificial Intelligence

DeepSeek V4: Open‑Source Bombshell That Shakes Closed‑Source AI Giants

DeepSeek V4’s preview launch unveils two open‑source LLM variants—V4‑Pro with 1.6 T parameters and V4‑Flash with 284 B—both supporting a default 1 M‑token context, and introduces novel mHC residual scheduling, hybrid CSA/HCA sparse attention, and Muon optimizer tricks that together deliver top‑tier performance rivaling closed‑source models across coding, long‑text, and reasoning benchmarks.

DeepSeekLarge Language ModelOpen-source AI
0 likes · 10 min read
DeepSeek V4: Open‑Source Bombshell That Shakes Closed‑Source AI Giants
AI Era Action Guide
AI Era Action Guide
Apr 24, 2026 · Artificial Intelligence

DeepSeek-V4 Launches with 1M Token Context and Leading Open-Source Agent – A Chinese AI Milestone

DeepSeek has unveiled the V4 preview, offering two open‑source large language models—Pro (1.6 T parameters) and Flash (284 B)—both supporting 1 million‑token context, sparse‑attention efficiency gains, top‑ranked Agent capabilities, and competitive reasoning performance, marking a major milestone for Chinese AI.

1M token contextAgentDeepSeek
0 likes · 5 min read
DeepSeek-V4 Launches with 1M Token Context and Leading Open-Source Agent – A Chinese AI Milestone
Tech Musings
Tech Musings
Apr 24, 2026 · Artificial Intelligence

DeepSeek-V4 Unveiled: 1M Context Length and Ascend Compute Power

DeepSeek has launched the open‑source DeepSeek‑V4 series, offering Pro and Flash models with a 1 million token context window, a novel sparse attention mechanism, performance that rivals Opus 4.6 on coding and knowledge benchmarks, tiered pricing, and future cost reductions once Ascend 950 supernodes become widely available.

1M contextAI benchmarkingDeepSeek-V4
0 likes · 5 min read
DeepSeek-V4 Unveiled: 1M Context Length and Ascend Compute Power
AI Engineering
AI Engineering
Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unveiled: How Its Million-Token Context Redefines Open-Source LLMs

DeepSeek released the V4 preview, introducing V4‑Pro (1.6 T parameters, 49 B activation neurons, 33 T tokens) and V4‑Flash (284 B parameters, 13 B activation neurons, 32 T tokens) with 1 M token context, a novel DSA sparse attention that reduces compute and memory, and performance that rivals top closed‑source models in agentic coding, world‑knowledge and reasoning benchmarks, while offering an API compatible with OpenAI and Anthropic.

DeepSeekLarge Language ModelOpenAI API Compatibility
0 likes · 5 min read
DeepSeek V4 Unveiled: How Its Million-Token Context Redefines Open-Source LLMs
ArcThink
ArcThink
Apr 11, 2026 · Artificial Intelligence

DeepSeek V4 Preview: A Sovereign Shift Beyond Benchmarks

Developers can sift through official silence and industry leaks—internal statements, Ascend 950PR supply‑chain hints, and sparse‑attention innovations—to assess DeepSeek V4’s likely technical leaps, from million‑token context to native Ascend training, and its strategic impact on the open‑source AI landscape and CUDA independence.

AI model analysisDeepSeekHuawei Ascend
0 likes · 27 min read
DeepSeek V4 Preview: A Sovereign Shift Beyond Benchmarks
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 24, 2026 · Artificial Intelligence

A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures

This article reviews and compares the most important attention variants used in modern large language models—including multi‑head attention, grouped‑query attention, multi‑head latent attention, sparse and sliding‑window attention, gated attention, and hybrid designs—detailing their motivations, memory trade‑offs, example architectures, and experimental findings.

Hybrid ArchitectureLLMMHA
0 likes · 29 min read
A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures
SuanNi
SuanNi
Feb 23, 2026 · Artificial Intelligence

How GLM‑5 Breaks New Ground with Sparse Attention and Asynchronous RL

GLM‑5, the 744‑billion‑parameter open‑source LLM, introduces DeepSeek Sparse Attention, Multi‑latent Attention, Muon Split optimizer, and a fully asynchronous agentic reinforcement‑learning framework, achieving state‑of‑the‑art performance on long‑context, code, math, and multimodal benchmarks while running efficiently on domestic Chinese chips.

GLM-5Open-source AISparse attention
0 likes · 12 min read
How GLM‑5 Breaks New Ground with Sparse Attention and Asynchronous RL
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionLong Context
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention
0 likes · 18 min read
Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090
AI Insight Log
AI Insight Log
Feb 12, 2026 · Artificial Intelligence

GLM-5 Unveiled: 744B Parameters, Claude Opus 4.5‑Level Performance, Epic Agent Upgrade

Z.ai released the open‑source GLM‑5 model with 744 billion parameters, 28.5 T tokens of training data, and new Sparse Attention and Slime RL infrastructure, achieving top open‑source rankings and near‑Claude Opus 4.5 performance on Vending Bench 2 and CC‑Bench‑V2 while adding multi‑scenario agent capabilities.

Agentic EngineeringGLM-5Large Language Model
0 likes · 6 min read
GLM-5 Unveiled: 744B Parameters, Claude Opus 4.5‑Level Performance, Epic Agent Upgrade
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 10, 2026 · Artificial Intelligence

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.

AI Stock ImpactDeepSeekGLM-5
0 likes · 6 min read
Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM OptimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage
0 likes · 20 min read
How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs
Baidu Geek Talk
Baidu Geek Talk
Dec 24, 2025 · Artificial Intelligence

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

The article explains how Baidu’s Baige team integrated a Context Parallelism strategy into DeepSeek V3.2, detailing the DSA architecture, the limitations of traditional tensor and sequence parallelism, and how CP distributes computation and memory across GPUs to achieve up to an 80 % reduction in token‑to‑first‑token latency for ultra‑long 128K‑token contexts.

Context ParallelismDeepSeekLLM
0 likes · 9 min read
Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 24, 2025 · Artificial Intelligence

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

The article explains how the newly merged Context Parallelism (CP) technique in SGLang, combined with DeepSeek V3.2's Sparse Attention architecture, reduces first‑token latency by up to 80% and alleviates memory pressure for ultra‑long 128K‑token sequences, detailing both algorithmic innovations and engineering solutions.

AI InfrastructureContext ParallelismDistributed Inference
0 likes · 10 min read
How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens
Baidu Geek Talk
Baidu Geek Talk
Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU memoryLLM Inference
0 likes · 20 min read
How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
Data Party THU
Data Party THU
Dec 10, 2025 · Artificial Intelligence

How DeepSeek‑V3.2 Cuts Inference Cost and Boosts Agent Skills with Sparse Attention

DeepSeek's V3.2 release introduces a dual‑model lineup, a Sparse Attention architecture that halves long‑context inference cost, a post‑training reinforcement‑learning pipeline that exceeds 10% of pre‑training compute, and a revamped agent framework that dramatically improves tool‑use and reasoning performance across benchmarks.

Agentic AIDeepSeekLarge Language Model
0 likes · 11 min read
How DeepSeek‑V3.2 Cuts Inference Cost and Boosts Agent Skills with Sparse Attention
Fun with Large Models
Fun with Large Models
Dec 5, 2025 · Artificial Intelligence

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

This article provides a detailed, easy‑to‑understand analysis of DeepSeek‑Math‑V2’s self‑verification training method and DeepSeek‑V3.2’s GRPO framework, sparse‑attention DSA mechanism, massive agent data pipeline, and benchmark results that place both models among the world’s top open‑source large language models.

DeepSeekGRPOLLM
0 likes · 19 min read
DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM Inference
0 likes · 21 min read
How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 25, 2025 · Artificial Intelligence

Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It

The Baidu Baige team discovered that DeepSeek‑V3.2‑Exp’s long‑context performance lagged behind the official report, traced the issue to a subtle RoPE layout mismatch in the open‑source inference demo, collaborated with DeepSeek to fix it, and verified that the model’s speed and accuracy fully recovered across multiple benchmarks.

AI InfrastructureDeepSeekLLM Inference
0 likes · 9 min read
Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIEfficient AttentionHybrid Architecture
0 likes · 17 min read
Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling
Data Party THU
Data Party THU
Oct 25, 2025 · Artificial Intelligence

How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

InfLLM‑V2 introduces a zero‑parameter, train‑efficient sparse‑attention framework that dramatically speeds up long‑sequence processing while requiring only 5 B tokens for training, and the open‑source MiniCPM4.1 model demonstrates comparable performance to dense attention on both long‑text understanding and deep‑thinking benchmarks.

EfficiencyInfLLM-V2MiniCPM4.1
0 likes · 10 min read
How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs
Fun with Large Models
Fun with Large Models
Sep 30, 2025 · Artificial Intelligence

DeepSeek-V3.2 Architecture Breakthrough: A 5‑Minute Guide to Its Core Features

The article introduces DeepSeek-V3.2, highlighting its new DeepSeek Sparse Attention (DSA) that boosts training and inference efficiency by up to 50%, cuts model usage costs dramatically, explains the updated API endpoints, and details the four‑stage post‑training pipeline that underpins the model’s performance improvements.

AI ArchitectureDSADeepSeek-V3.2
0 likes · 8 min read
DeepSeek-V3.2 Architecture Breakthrough: A 5‑Minute Guide to Its Core Features
AI Algorithm Path
AI Algorithm Path
May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization
0 likes · 18 min read
Uncovering the Secrets of LLM Inference Optimization
AIWalker
AIWalker
Feb 25, 2025 · Artificial Intelligence

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.

GPU OptimizationHunyuanVideoSliding Tile Attention
0 likes · 16 min read
Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×
Architect
Architect
Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training
0 likes · 13 min read
Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts
Architecture Digest
Architecture Digest
Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

Attention MechanismLLMLong Context
0 likes · 12 min read
MoBA: Mixture of Block Attention for Long‑Context Large Language Models
Architects' Tech Alliance
Architects' Tech Alliance
Feb 24, 2025 · Artificial Intelligence

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington

The NSA mechanism introduces a three‑branch hardware‑optimized sparse attention architecture—token compression, token selection, and sliding window—combined with learnable gating to balance global and local context, dramatically improving inference speed and efficiency for long‑context large language models.

AI ArchitectureDeepSeekSparse attention
0 likes · 5 min read
NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington
ZhongAn Tech Team
ZhongAn Tech Team
Feb 22, 2025 · Artificial Intelligence

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.

AIIndustry TrendsKnowledge Graph
0 likes · 15 min read
How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave
AIWalker
AIWalker
Feb 19, 2025 · Artificial Intelligence

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.

DeepSeekLLM OptimizationNSA
0 likes · 9 min read
DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author
Software Engineering 3.0 Era
Software Engineering 3.0 Era
Feb 18, 2025 · Artificial Intelligence

DeepSeek R1’s Disruptive Breakthrough: Native Sparse Attention Redefines Long‑Context Modeling

The DeepSeek paper on Native Sparse Attention (NSA) presents a hardware‑aligned, trainable sparse‑attention architecture that slashes O(n²) costs, delivers up to 11.6× speedups and 2.3‑point accuracy gains on long‑context benchmarks, and reduces training expense by 47% while scaling to 64k tokens.

DeepSeekGPU OptimizationLong-context modeling
0 likes · 11 min read
DeepSeek R1’s Disruptive Breakthrough: Native Sparse Attention Redefines Long‑Context Modeling
IT Architects Alliance
IT Architects Alliance
Feb 8, 2025 · Artificial Intelligence

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

This article examines DeepSeek's advanced Transformer‑based architecture, dynamic routing, MoE system, multi‑stage training, efficient inference, multimodal capabilities, real‑world applications, technical challenges, and future prospects, providing a comprehensive technical analysis of the model's strengths and limitations.

AI ArchitectureDeepSeekLarge Language Model
0 likes · 15 min read
Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 11, 2022 · Artificial Intelligence

How Structure-Aware Sparse Attention Boosts Long-Code Transformers

The SASA model, a structure‑aware sparse‑attention Transformer developed by Alibaba Cloud PAI and Prof. Gao Ming’s team, improves long‑code sequence processing by sparsifying self‑attention using top‑k frequency and AST pattern matrices, achieving higher performance and lower memory/computation costs on CodeXGLUE benchmarks.

ASTCode UnderstandingLong Sequences
0 likes · 8 min read
How Structure-Aware Sparse Attention Boosts Long-Code Transformers
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 14, 2022 · Artificial Intelligence

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

BERTMaskingSelf-Attention
0 likes · 7 min read
BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More