How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

This comprehensive analysis by the Peking University AI Alignment team dissects the technical innovations behind DeepSeek‑R1, DeepSeek‑R1 Zero, and Kimi‑K1.5, covering reinforcement‑learning‑based post‑training, rule‑based rewards, GRPO optimization, scaling laws, multimodal extensions, safety challenges, and future research directions.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

DeepSeek‑R1 and Kimi‑K1.5 Technical Deep‑Dive

The Peking University AI Alignment team presents a 20,000‑word technical commentary on strong reasoning models such as DeepSeek‑R1, DeepSeek‑R1 Zero, and Kimi‑K1.5, recommending readers watch the accompanying video for the best experience.

DeepSeek‑R1 overview
DeepSeek‑R1 overview

Post‑Training Scaling Laws and Reinforcement Learning

Recent post‑training phases have become crucial for enhancing inference ability and societal alignment. Following OpenAI’s o1, the community explores Inference‑Time Scaling by extending chain‑of‑thought (CoT) length. DeepSeek‑R1 Zero demonstrates that large‑scale reinforcement learning (RL) without any supervised fine‑tuning (SFT) can dramatically improve long‑text reasoning and self‑reflection.

DeepSeek‑R1 achieves top scores on mathematics and coding benchmarks (e.g., 79.8% on AIME 2024, surpassing o1) and performs strongly on knowledge‑heavy tasks such as MMLU and GPQA.

Technical Pipeline of DeepSeek‑R1 Zero

Two reward signals are used: an accuracy‑based reward that judges final answer correctness, and a format reward that forces the model to emit a thinking token surrounding its reasoning process. Rule‑based rewards avoid the pitfalls of neural reward models, such as reward hacking.

Training relies on a group‑relative policy optimization (GRPO) algorithm, which removes the need for a separate critic model by comparing multiple sampled outputs within a group to compute relative advantages.

RL scaling law illustration
RL scaling law illustration

Comparison with Kimi‑K1.5

Kimi‑K1.5 focuses on extending long‑text CoT via RL, employing a REINFORCE‑style algorithm and length‑penalty mechanisms to avoid over‑thinking. It also incorporates multimodal data (visual question answering, synthetic visual reasoning, and OCR‑style text‑in‑image) to improve generalisation.

Both models share insights: large base models, massive RL, and rule‑based rewards are essential; however, Kimi‑K1.5 adds curriculum learning, priority sampling, and a long‑to‑short distillation stage to reduce inference cost.

STaR, PPO, and GRPO

STaR (Self‑Training via Reasoning) generates rationales during fine‑tuning, while PPO adds a KL penalty to the reward. GRPO simplifies PPO by integrating the KL term directly into the objective and using group‑wise advantage estimation, leading to more stable and compute‑efficient training.

Future Directions

The authors outline several research avenues: explainability of long reasoning chains, multimodal penetration, safety‑aligned reinforcement learning, formal verification, and alignment‑audit techniques. They propose the Align‑Anything framework to support arbitrary‑modality preference learning and stress the importance of addressing reward‑hacking, over‑thinking, and model elasticity.

Future research roadmap
Future research roadmap
large language modelsDeepSeekreinforcement learningAI alignmentKimistrong reasoning
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.