Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Model Overview

Keye‑VL‑671B‑A37B is a multimodal large language model that integrates visual perception and reasoning. It uses DeepSeek‑V3‑Terminus as the language backbone and a KeyeViT visual encoder initialized from KeyeVL‑1.5, linked via a lightweight MLP projector.

Pre‑training

Stage 1 – Freeze the visual encoder and language model, training only a randomly initialized projector to align visual and textual embeddings.

Stage 2 – Unfreeze all parameters and train on approximately 300 B high‑quality multimodal tokens (including OCR, tables, and charts).

Stage 3 – Perform an annealing stage on higher‑quality data to improve fine‑grained perception.

The pipeline curates a 1 T‑token multimodal dataset with strict filtering to keep computational cost manageable while preserving strong perception capabilities.

Post‑training (Fine‑tuning)

SFT (Supervised Fine‑tuning) : Mix instruction data with long chain‑of‑thought (CoT) examples; adding more CoT reduces loss and improves benchmark performance.

Cold‑start : Filter CoT samples to retain those with 25‑75 % correctness, removing redundant reasoning and boosting logical ability.

Reinforcement Learning : Apply sequence‑level GSPO (Group Sequence Policy Optimization) with a verifier model (Keye‑VL‑1.5 8B) that judges logical consistency and answer correctness. The verifier outperforms Qwen‑2.5‑VL 72B in detection accuracy.

Evaluation

Keye‑VL‑671B‑A37B achieves top scores on major multimodal benchmarks:

General Vision Understanding : MMBench, MMMU, MMStar, RealWorldQA, etc.

Mathematical & Logical Reasoning : MathVista, VisuLogic, OlympiadBench.

Video Understanding : MMVU, LongVideoBench, VideoMME.

Across STEM, OCR, and pure‑text tasks the model consistently outperforms other open‑source multimodal systems.

Resources

GitHub repository: https://github.com/Kwai-Keye/Keye HuggingFace model hub:

https://huggingface.co/Kwai-Keye/Keye-VL-671B-A37B
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learningOpen Sourcelarge language modelmultimodalvideo understandingVision-Language
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.