Artificial Intelligence 10 min read

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.

Kuaishou Tech

Nov 28, 2025

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Model Overview

Keye‑VL‑671B‑A37B is a multimodal large language model that integrates visual perception and reasoning. It uses DeepSeek‑V3‑Terminus as the language backbone and a KeyeViT visual encoder initialized from KeyeVL‑1.5, linked via a lightweight MLP projector.

Pre‑training

Stage 1 – Freeze the visual encoder and language model, training only a randomly initialized projector to align visual and textual embeddings.

Stage 2 – Unfreeze all parameters and train on approximately 300 B high‑quality multimodal tokens (including OCR, tables, and charts).

Stage 3 – Perform an annealing stage on higher‑quality data to improve fine‑grained perception.

The pipeline curates a 1 T‑token multimodal dataset with strict filtering to keep computational cost manageable while preserving strong perception capabilities.

Post‑training (Fine‑tuning)

SFT (Supervised Fine‑tuning) : Mix instruction data with long chain‑of‑thought (CoT) examples; adding more CoT reduces loss and improves benchmark performance.

Cold‑start : Filter CoT samples to retain those with 25‑75 % correctness, removing redundant reasoning and boosting logical ability.

Reinforcement Learning : Apply sequence‑level GSPO (Group Sequence Policy Optimization) with a verifier model (Keye‑VL‑1.5 8B) that judges logical consistency and answer correctness. The verifier outperforms Qwen‑2.5‑VL 72B in detection accuracy.

Evaluation

Keye‑VL‑671B‑A37B achieves top scores on major multimodal benchmarks:

General Vision Understanding : MMBench, MMMU, MMStar, RealWorldQA, etc.

Mathematical & Logical Reasoning : MathVista, VisuLogic, OlympiadBench.

Video Understanding : MMVU, LongVideoBench, VideoMME.

Across STEM, OCR, and pure‑text tasks the model consistently outperforms other open‑source multimodal systems.

Resources

GitHub repository: https://github.com/Kwai-Keye/Keye HuggingFace model hub:

https://huggingface.co/Kwai-Keye/Keye-VL-671B-A37B

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Open Source large language model multimodal video understanding Vision-Language

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.