Artificial Intelligence 6 min read

Kimi K2-Thinking: 1T‑Parameter Agent Model Beats GPT‑5 on Humanity’s Last Exam

Kimi's open‑source K2‑Thinking model, a 1‑trillion‑parameter agent with native INT4 quantization and 256k context, achieves SOTA performance on benchmarks like Humanity’s Last Exam, BrowseComp and SEAL‑0, outperforms GPT‑5 and Grok‑4, and demonstrates complex tool‑driven reasoning through real‑world examples.

Baobao Algorithm Notes

Nov 7, 2025

Kimi K2-Thinking: 1T‑Parameter Agent Model Beats GPT‑5 on Humanity’s Last Exam

Model Overview

K2‑Thinking is an open‑source large language model released by Kimi. It has a total of 1 trillion parameters with 32 billion active parameters, supports native INT4 quantization, and provides a 256 k token context window.

Model‑as‑Agent Design

The model follows a “model‑as‑Agent” paradigm, meaning it can initiate tool calls while reasoning. It can execute up to 300 tool calls in a single session and maintain stable multi‑turn reasoning without any manually crafted prompt logic such as if/while constructs.

Benchmark Performance

Humanity’s Last Exam (HLE) : 44.9 % score in tool‑enhanced mode, outperforming GPT‑5 (41.7 %) and Grok‑4 (41.0 %).

State‑of‑the‑art results were also reported on BrowseComp and SEAL‑0, demonstrating strong agentic search, programming, writing, and comprehensive reasoning capabilities.

Demonstration Tasks

Example 1 – Policy‑Rule Calculation

The task required calculating the total energy‑point score for a three‑person Beijing household based on detailed car‑license lottery participation rules. K2‑Thinking performed web browsing to retrieve policy details, interpreted the rules, carried out multi‑step verification, and produced a completely correct answer, whereas GPT‑5 gave an incorrect result.

Example 2 – Nvidia Market‑Cap Retrieval

The task asked the model to collect Nvidia’s month‑end market‑cap data from January to October 2025 and generate an animated line chart viewable in a browser. K2‑Thinking decomposed the problem into 11 sub‑tasks, fetched the required data, and generated the HTML/JavaScript code that renders a correct and visually appealing chart.