Artificial Intelligence 12 min read

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Fast‑search platform Kwai has open‑sourced the 8‑billion‑parameter multimodal LLM Keye‑VL‑1.5, which introduces a slow‑fast frame encoding, a progressive four‑stage pre‑training pipeline, and an automated data construction workflow, achieving state‑of‑the‑art results on video and vision‑language benchmarks and surpassing many closed‑source models.

Kuaishou Tech

Sep 5, 2025

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Keye‑VL‑1.5‑8B Overview

Kwai (快手) recently released the multimodal large language model Keye‑VL‑1.5‑8B. Compared with previous versions, Keye‑VL‑1.5 shows a significant boost in overall performance, especially in basic visual understanding such as element recognition, reasoning, and temporal information processing, outperforming many closed‑source models including GPT‑4o at the same scale.

Key Innovations

Slow‑Fast encoding strategy : An algorithm automatically distinguishes slow frames from fast frames, allocating only 30% of the token budget to fast frames. Special tokens and timestamps mark frame boundaries, balancing performance and computational cost.

Progressive four‑stage pre‑training : Starts with cross‑modal alignment and multi‑task pre‑training, then expands context length from 8K to 128K for longer video handling, and finally fuses models trained on different data mixes to improve robustness and reduce bias.

Comprehensive training pipeline : A five‑step automated data‑construction workflow iteratively applies a GSPO algorithm for general reinforcement learning and alignment, dramatically enhancing inference ability and alignment with human preferences.

Benchmark Performance

On several public video benchmarks, Keye‑VL‑1.5‑8B achieves the best performance among models of comparable size, obtaining industry‑leading scores on large‑scale tests such as MMMUval and AI2D. The model also excels in video understanding, reaching 66 points on Video‑MMMU.

Model Architecture

Keye‑VL‑1.5 follows a classic multimodal LLM architecture composed of three core components: a Vision Transformer (ViT) encoder (SigLIP‑400M‑384‑14), an MLP projector, and a language decoder (Qwen3‑8B). The vision encoder uses native‑resolution ViT with 2D‑ROPE for high‑resolution image understanding and is pretrained on 500 B tokens from diverse multimodal data. For image inputs, each image is represented by 20,480 tokens to preserve fine details.

Training and Post‑Training Strategies

The training process consists of three main stages:

Cross‑modal alignment : Optimizes the projection MLP to establish a solid alignment foundation.

Multi‑task pre‑training : Fine‑tunes all model parameters end‑to‑end, greatly enhancing basic visual understanding.

Annealing training : Extends context length to 128K, adjusts RoPE frequencies, and incorporates long‑video, long‑text, and large‑scale image data.

After pre‑training, a four‑stage post‑training pipeline further improves the model:

Stage 1 – Supervised fine‑tuning & multi‑preference optimization .

Stage 2 – Long‑chain reasoning cold‑start : Generates multiple reasoning traces per QA pair and evaluates confidence.

Stage 3 – Iterative general reinforcement learning : Uses GSPO‑based reward models with progressive prompting for difficult samples.

Stage 4 – Alignment reinforcement learning : Aligns model responses with human preferences.

Experimental Results

Keye‑VL‑1.5 achieves industry‑leading scores on a wide range of multimodal tasks. In vision‑language benchmarks, it reaches 71.4% on MMMUval and 79.5% on OpenCompass, outperforming same‑scale competitors. It also attains 62.7% accuracy on HallusionBench, reducing hallucination, and scores 66 on Video‑MMMU, demonstrating superior video understanding.

Internal video evaluation covering eight dimensions (visual element recognition, reasoning, temporal understanding, knowledge‑based QA, description, robustness, creativity, domain expertise) shows a total score of 3.53, surpassing the previous Keye‑VL‑Preview by 0.51 points and beating the MiMoVL‑7B‑RL‑2508 baseline.

Resources

Project page: https://kwai-keye.github.io/

Technical report: https://arxiv.org/pdf/2509.01563

GitHub repository: https://github.com/Kwai-Keye/Keye

Model checkpoint (HuggingFace): https://huggingface.co/Kwai-Keye/Keye-VL-1.5-8B

Future Outlook

Leveraging Kwai’s extensive short‑video expertise, Keye‑VL is positioned to continue advancing video understanding, marking a solid step toward the next era of multimodal large language models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI large language model video understanding benchmark performance progressive pretraining slow-fast encoding

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.