Artificial Intelligence 8 min read

Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released

Keye‑VL‑2.0, an open‑source MoE multimodal foundation model, tackles hour‑level video understanding and agentic intelligence by embedding DeepSeek Sparse Attention into a GQA‑based architecture, enabling near‑lossless 256 K token context, four‑stage pre‑training, diverse RL distillation techniques, and achieving state‑of‑the‑art results on long‑video benchmarks, with weights publicly released.

Kuaishou Tech

Jun 11, 2026

Keye-VL-2.0 Brings DeepSeek Sparse Attention to Multimodal AI – Report Released

Overview

Keye‑VL‑2.0 is an open‑source Mixture‑of‑Experts (MoE) multimodal foundation model designed to advance hour‑level video understanding and agentic intelligence. The technical report introduces DeepSeek Sparse Attention (DSA) into a GQA‑based multimodal architecture, achieving near‑lossless 256 K token context processing while capturing key frames and long‑range temporal dependencies.

Challenges of Long‑Video Multimodal Modeling

Moving from short‑video to hour‑level video brings two core challenges:

Context length and compute cost: Longer video inputs increase visual token count and KV‑cache memory, causing dense attention to hit compute and memory bottlenecks. Reducing frame sampling preserves continuity but raises cost.

Multi‑task interference: Joint training on video understanding, tool use, search, coding, and complex reasoning can cause capability conflicts and degrade basic reasoning ability.

Technical Solutions

Keye‑VL‑2.0 addresses these challenges through two complementary tracks:

Multimodal DSA: Integrates DeepSeek Sparse Attention into the decoder attention pathway, combined with a GQA‑based backbone. The model uses a MQA‑style Lightning Indexer to select critical tokens and a GQA Sparse Aggregation for sparse aggregation, setting k=2048 to reduce attention complexity from O(L²) to O(L·k).

Cross‑modal Multi‑Teacher On‑Policy Distillation (MOPD): Employs multiple specialist teacher models to provide token‑level feedback for different modalities and tasks, then distills the abilities back into a unified MoE backbone.

Four‑Stage Pre‑training Curriculum

The model is trained in four stages, progressively expanding context length and task coverage:

Stage 0: Freeze ViT and LLM, train only the projector to establish initial visual‑language alignment.

Stage 1: General multimodal pre‑training at 32 K context, building vision‑language alignment, image perception, video understanding, and OCR capabilities.

Stage 2: Extend context to 64 K and inject abilities for OCR, VQA, STEM, GUI grounding, counting, coding, tool use, and search. Video tokens grow from 24 K to 64 K.

Stage 3: Further extend to 256 K context, targeting long video, long documents, multi‑document input, multi‑image dialogue, long code context, and extended agent trajectories. Video tokens reach up to 180 K, covering 15 s, 15 min, and 2 h videos.

Post‑training Enhancements

After multimodal pre‑training, Keye‑VL‑2.0 applies several post‑training methods to convert perception and alignment into stable instruction following, reasoning, and agentic capabilities:

Synthetic CoT: Constructs explicit reasoning chains for STEM, counting, video reasoning, and complex visual‑language QA.

General RL: Improves reasoning, answer reliability, and robustness across multimodal and pure‑text scenarios.

Video RL: Refines temporal alignment, event tracking, and long‑range information aggregation.

Agentic RL: Covers code, tool‑use, and search tasks requiring multi‑step environment interaction.

Cross‑Modal MOPD: Uses multiple teacher models to provide token‑level feedback for different modalities and tasks, distilling the knowledge back into the MoE backbone.

Evaluation and Open Release

The report presents extensive evaluations covering video understanding, temporal localization, reasoning, STEM, agent benchmarks, and general vision‑language abilities. Keye‑VL‑2.0‑30B‑A3B achieves state‑of‑the‑art performance among models of comparable scale, especially on fine‑grained TimeLens temporal localization and long‑video benchmarks such as Video‑MME‑v2 and LongVideoBench.

All model weights and the technical report are publicly released for the community to read, reproduce, test, and provide feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MoE multimodal pretraining sparse attention long video agentic intelligence RL distillation

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.