Artificial Intelligence 9 min read

How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms

The Kimi K2.5 technical report reveals how a Chinese team combined joint text‑vision training, a novel Zero‑Vision SFT method, and a parallel agent‑swarm architecture to deliver top‑ranked multimodal performance, dramatically faster inference, and open‑source access for broader AI research.

PaperAgent

Feb 2, 2026

How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms

Moonshot AI’s Kimi K2.5 technical report showcases a series of breakthroughs that propelled the model into the top‑3 usage share on OpenRouter within three days, highlighting its impact on the multimodal AI landscape.

1. Joint Text‑Vision Training from Day One

Instead of the conventional pipeline—training a large language model first and later attaching a vision module—Kimi K2.5 adopts Joint Training , allowing text and visual modalities to learn together from the start. Experiments comparing early, middle, and late fusion strategies found that early fusion with a low visual proportion yields the best results, indicating that excessive visual data is not beneficial; rather, early alignment of modalities drives co‑evolution.

The report also introduces Zero‑Vision SFT , a clever “visual bootstrapping” technique: after 1.5 trillion image‑text pre‑training, only textual data is used for supervised fine‑tuning, yet the model retains strong visual reasoning capabilities without additional expensive visual demonstrations.

On the Design Arena benchmark, Kimi K2.5 achieved first place, demonstrating a rare “taste” for aesthetic and design understanding that stems from its unified multimodal training.

2. Parallel Agent Swarm for Efficient Task Execution

Traditional agent systems execute steps serially, causing linear growth in inference time as task complexity rises. Kimi K2.5 introduces an Agent Swarm that decomposes a task into multiple sub‑problems and runs them in parallel.

The swarm is coordinated by an Orchestrator that dynamically splits tasks, creates sub‑agents, assigns work, and aggregates results. Efficient parallelism is encouraged by a specially designed Parallel Agent Reinforcement Learning (PARL) reward , which combines three components: a core task‑completion quality reward, a reward for initiating parallelism (preventing agents from “slacking”), and a reward for high sub‑task completion rates (avoiding useless sub‑tasks).

Empirical results show the swarm runs 3 – 4.5× faster than a single‑agent baseline on complex searches (WideSearch) while raising accuracy from 72.7 % to 79.0 %.

3. MoonViT‑3D: A Unified Image‑and‑Video Encoder

Kimi K2.5’s visual backbone, MoonViT‑3D , processes both static images and video streams with a single Transformer. It packs four consecutive frames into a “spatiotemporal block,” preserving temporal information while sharing parameters across frames.

During pre‑training, images and videos are jointly sampled, enabling the model to inherit video understanding directly from image training. Inference requires no extra modules or fine‑tuning, and a 4× temporal pooling allows the system to handle four times more frames within the same context window, making long‑video tasks (e.g., surveillance replay, live‑stream summarization) feasible.

4. 24‑Hour Game Video Analysis with Agent Swarm

The report demonstrates a fully automated analysis of a complete 24‑hour play‑through of "Black Myth: Wukong" (32 videos, 40 GB). The main agent splits the video into segments, each sub‑agent processes a segment in parallel—extracting key frames, detecting events such as boss fights or level‑ups—and the orchestrator compiles an HTML report with timelines, video clips, and interactive charts.

This pipeline transforms video understanding into a three‑step workflow: decompose, parallel extract, and reconstruct knowledge.

Conclusion

Kimi K2.5 not only pushes multimodal model performance but also illustrates a concrete path toward general‑purpose intelligent agents that can see, think, act, and collaborate. The model and its code are openly released on Hugging Face, inviting the community to build next‑generation assistants, coding partners, research aides, and creative collaborators.