Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks
Kwai's newly released Keye-VL-1.5-8B multimodal large language model dramatically improves visual, reasoning, and temporal understanding, achieving top scores on public video benchmarks and surpassing closed‑source models like GPT‑4o, while offering an open‑source release and detailed technical documentation.
Keye-VL-1.5-8B, an 8‑billion‑parameter multimodal large language model released by Kwai, significantly improves visual understanding, reasoning, and temporal information processing, surpassing many closed‑source models such as GPT‑4o on equal‑scale benchmarks.
Key Innovations
Slow‑Fast encoding strategy: automatically distinguishes slow and fast frames, allocating 30 % of token budget to fast frames and using special tokens and timestamps to balance performance and cost.
Four‑stage progressive pre‑training: from cross‑modal alignment and multi‑task pre‑training to an annealing phase that expands context length from 8K to 128K, enabling long‑video and complex visual content handling.
Fully optimized training pipeline: five‑step automated data construction, GSPO‑based iterative reinforcement learning and alignment, improving inference ability and human‑preference alignment.
Architecture
The model follows a classic multimodal LLM architecture composed of a Vision Transformer (ViT) encoder (SigLIP‑400M‑384‑14), an MLP projector, and a Qwen3‑8B language decoder. Images are encoded with 20,480 tokens to preserve detail.
Benchmark Performance
On public video benchmarks (MMMU‑val, AI2D, Video‑MMMU) Keye‑VL‑1.5‑8B achieves top scores among same‑scale models, including industry‑best results on MMMUval (71.4 %) and OpenCompass (79.5 %). It also attains 62.7 % accuracy on HallusionBench, reducing hallucination.
Internal evaluations show a comprehensive score of 3.53, improving over the preview version by 0.51 points, with notable gains in correctness (+0.57) and completeness (+0.25). Compared with MiMoVL‑7B‑RL‑2508, it scores higher overall (3.53 vs 3.40) and leads in reasoning, temporal understanding, and robustness.
Case Studies
Examples demonstrate precise temporal segment detection, reasoning about animal behavior, and detailed scene description, highlighting the model’s strong temporal, inferential, and descriptive capabilities.
Resources
Project homepage: https://kwai-keye.github.io/
Technical report (arXiv): https://arxiv.org/pdf/2509.01563
GitHub repository: https://github.com/Kwai-Keye/Keye
Model on Hugging Face: https://huggingface.co/Kwai-Keye/Keye-VL-1.5-8B
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
