Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks
Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.
Model Overview
Keye‑VL‑671B‑A37B is a multimodal large language model that integrates visual perception and reasoning. It uses DeepSeek‑V3‑Terminus as the language backbone and a KeyeViT visual encoder initialized from KeyeVL‑1.5, linked via a lightweight MLP projector.
Pre‑training
Stage 1 – Freeze the visual encoder and language model, training only a randomly initialized projector to align visual and textual embeddings.
Stage 2 – Unfreeze all parameters and train on approximately 300 B high‑quality multimodal tokens (including OCR, tables, and charts).
Stage 3 – Perform an annealing stage on higher‑quality data to improve fine‑grained perception.
The pipeline curates a 1 T‑token multimodal dataset with strict filtering to keep computational cost manageable while preserving strong perception capabilities.
Post‑training (Fine‑tuning)
SFT (Supervised Fine‑tuning) : Mix instruction data with long chain‑of‑thought (CoT) examples; adding more CoT reduces loss and improves benchmark performance.
Cold‑start : Filter CoT samples to retain those with 25‑75 % correctness, removing redundant reasoning and boosting logical ability.
Reinforcement Learning : Apply sequence‑level GSPO (Group Sequence Policy Optimization) with a verifier model (Keye‑VL‑1.5 8B) that judges logical consistency and answer correctness. The verifier outperforms Qwen‑2.5‑VL 72B in detection accuracy.
Evaluation
Keye‑VL‑671B‑A37B achieves top scores on major multimodal benchmarks:
General Vision Understanding : MMBench, MMMU, MMStar, RealWorldQA, etc.
Mathematical & Logical Reasoning : MathVista, VisuLogic, OlympiadBench.
Video Understanding : MMVU, LongVideoBench, VideoMME.
Across STEM, OCR, and pure‑text tasks the model consistently outperforms other open‑source multimodal systems.
Resources
GitHub repository: https://github.com/Kwai-Keye/Keye HuggingFace model hub:
https://huggingface.co/Kwai-Keye/Keye-VL-671B-A37BSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
