Inside Kimi 1.5: Four Innovations That Supercharge Long‑Context Multimodal Reasoning

The article analyzes Kimi 1.5’s technical report, detailing its four core innovations, long‑to‑short inference tricks, reinforcement‑learning infrastructure, and benchmark results that show it out‑performing competing models in long‑context and multimodal tasks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Inside Kimi 1.5: Four Innovations That Supercharge Long‑Context Multimodal Reasoning

Overview

Kimi 1.5, released alongside DeepSeek’s o1‑class model, is described in a technical report hosted at

https://github.com/MoonshotAI/Kimi-k1.5/blob/main/Kimi_k1.5.pdf

. The report details four major innovations—long‑context expansion, improved strategy optimization, a simplified training framework, and multimodal support—and provides extensive benchmark results.

Benchmark Highlights

Long‑context reasoning: Kimi 1.5 outperforms OpenAI’s o1 on mathematics, pure‑text, and multimodal tasks.

Codeforces: performance on par with DeepSeek‑v3.

Short‑context inference: achieves a 60.8 score on the AIME benchmark, far above the next open model (39.2).

Long2Short Technique

The report introduces a “long‑to‑short” pipeline that merges a long‑COT (chain‑of‑thought) model with a short‑COT model. This enables faster inference with minimal loss in accuracy. Two sampling strategies are used:

Shortest‑reject sampling : generate n=8 samples and keep the shortest correct one for fine‑tuning.

DPO‑style preference learning : treat the shortest correct output as a positive example and longer or erroneous outputs as negatives.

Reinforcement‑Learning Simplifications

After a standard RL phase, a dedicated long‑to‑short RL stage applies a length penalty to encourage brevity while preserving correctness. The authors replace complex methods such as MCTS or PRM with a simplified PPO variant that omits a value network, following an Occam’s‑razor approach.

Prompt Design Principles

Diverse coverage : include tasks from STEM, coding, and general reasoning.

Balanced difficulty : distribute easy, medium, and hard problems.

Accurate evaluability : ensure prompts can be objectively assessed without reliance on search‑based methods.

RL Infrastructure

A master‑worker architecture orchestrates training:

Rollout workers generate trajectories by interacting with the model.

Trajectories are stored in a replay buffer to break temporal correlations.

Trainer workers sample from the buffer to update model weights.

A reward model evaluates outputs, and an integrated code‑execution service validates programming‑related responses.

Hybrid Deployment Pipeline

Training stage : Megatron and vLLM run in separate containers managed by a checkpoint‑engine wrapper. After Megatron finishes a training step, it releases GPU memory and hands the latest weights to vLLM.

Inference stage : vLLM loads the virtual model weights, receives incremental updates from Megatron via Mooncake, and serves inference until the checkpoint‑engine shuts down.

Subsequent training stage : vLLM memory is freed, Megatron reloads the weights, and the next training iteration begins.

The authors identify three challenges for existing frameworks:

Coordinating differing parallel strategies between Megatron and vLLM.

Minimizing idle GPU resources during online RL.

Scaling inference nodes dynamically without interrupting training.

Conclusion

Kimi 1.5 demonstrates that systematic infra optimization—combined with the long2short pipeline, simplified RL, and a well‑designed prompt suite—can substantially improve training efficiency, stability, and cost, positioning it as a strong contender in the current LLM landscape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Multimodal ReasoningKimi 1.5long-context inference
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.