Artificial Intelligence 11 min read

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Kwai's newly released Keye-VL-1.5-8B multimodal large language model dramatically improves visual, reasoning, and temporal understanding, achieving top scores on public video benchmarks and surpassing closed‑source models like GPT‑4o, while offering an open‑source release and detailed technical documentation.

Kuaishou Large Model

Sep 8, 2025

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Keye-VL-1.5-8B, an 8‑billion‑parameter multimodal large language model released by Kwai, significantly improves visual understanding, reasoning, and temporal information processing, surpassing many closed‑source models such as GPT‑4o on equal‑scale benchmarks.

Key Innovations

Slow‑Fast encoding strategy: automatically distinguishes slow and fast frames, allocating 30 % of token budget to fast frames and using special tokens and timestamps to balance performance and cost.

Four‑stage progressive pre‑training: from cross‑modal alignment and multi‑task pre‑training to an annealing phase that expands context length from 8K to 128K, enabling long‑video and complex visual content handling.

Fully optimized training pipeline: five‑step automated data construction, GSPO‑based iterative reinforcement learning and alignment, improving inference ability and human‑preference alignment.

Architecture

The model follows a classic multimodal LLM architecture composed of a Vision Transformer (ViT) encoder (SigLIP‑400M‑384‑14), an MLP projector, and a Qwen3‑8B language decoder. Images are encoded with 20,480 tokens to preserve detail.

Benchmark Performance

On public video benchmarks (MMMU‑val, AI2D, Video‑MMMU) Keye‑VL‑1.5‑8B achieves top scores among same‑scale models, including industry‑best results on MMMUval (71.4 %) and OpenCompass (79.5 %). It also attains 62.7 % accuracy on HallusionBench, reducing hallucination.

Internal evaluations show a comprehensive score of 3.53, improving over the preview version by 0.51 points, with notable gains in correctness (+0.57) and completeness (+0.25). Compared with MiMoVL‑7B‑RL‑2508, it scores higher overall (3.53 vs 3.40) and leads in reasoning, temporal understanding, and robustness.

Case Studies

Examples demonstrate precise temporal segment detection, reasoning about animal behavior, and detailed scene description, highlighting the model’s strong temporal, inferential, and descriptive capabilities.

Resources

Project homepage: https://kwai-keye.github.io/

Technical report (arXiv): https://arxiv.org/pdf/2509.01563

GitHub repository: https://github.com/Kwai-Keye/Keye

Model on Hugging Face: https://huggingface.co/Kwai-Keye/Keye-VL-1.5-8B

Vision-Language open-source multimodal LLM benchmark performance progressive pretraining slow-fast encoding

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.