Putting Kimi K2.5 and Kimi Code to the Test: Real‑World AI Agent Benchmarks

This article presents a hands‑on evaluation of Kimi K2.5 and its open‑source Kimi Code agent across a series of hard‑core prompts, covering Python API generation, cost‑optimized routing, multimodal ECharts visualisation, massive‑scale SQL optimisation, web‑search‑driven research, MoE explanation and video‑to‑code workflows.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Putting Kimi K2.5 and Kimi Code to the Test: Real‑World AI Agent Benchmarks

Overall Impression

Kimi K2.5 finally delivers a model that supports full‑modal and reasoning modes, matching or surpassing Gemini Pro in core capabilities. Its native Agent functions enable end‑to‑end coding tasks without external tools, positioning Kimi Code as a viable, lossless alternative to Claude Code.

Kimi Code Features

Kimi Code is a terminal‑based coding agent written entirely in Python. It can navigate directories, read and write files, run tests, and modify code autonomously, creating a complete "detect‑to‑fix" loop. It also accepts native video input, allowing visual‑driven code generation without third‑party plugins.

Test Items

Test Item 1: Python API function with retry and logging
Prompt: Please write a Python function that calls an API, includes three exponential‑backoff retries, detailed logging, and handles JSON parsing errors.

The model produced a robust implementation that satisfied both functionality and production‑grade observability.

Test Item 2: Cost calculation and hybrid routing for a 10k‑DAU API product
Prompt: Compare the API cost of Claude 3.5 Sonnet and Kimi K2 (¥16 per million tokens) for a product with 10,000 daily active users, and propose a hybrid routing scheme that balances quality and cost.

The answer included a precise cost breakdown and a practical hybrid routing recommendation.

Test Item 3: ECharts 5 real‑time K‑line chart with moving average
Prompt: Using ECharts 5, create a dark‑theme real‑time K‑line chart with an overlaid 5‑day moving average, loading all dependencies via CDN.

The generated web page displayed a smoothly updating candlestick chart with the required moving average overlay.

Test Item 4: Index and SQL rewrite for a ten‑million‑row orders table
Prompt: Analyze performance bottlenecks of <em>SELECT * FROM orders WHERE user_id = 100 ORDER BY create_time DESC</em> on a ten‑million‑row dataset, then provide a CREATE INDEX statement and an optimized SQL query.

The model correctly identified the need for a composite index on (user_id, create_time) and supplied both the index creation command and a rewritten query.

Test Item 5: Online search on 2026 solid‑state battery production progress
Prompt: Use an online search tool to gather all technical breakthroughs and manufacturers for solid‑state batteries slated for mass production in 2026, filter out press releases, present the data in a Markdown table, and write a 500‑word trend forecast.

The response delivered a concise table of manufacturers, highlighted key technical advances, and offered a forward‑looking analysis.

Test Item 6: Explain MoE mixed‑expert architecture with a restaurant‑chef analogy
Prompt: Explain large‑model MoE (Mixture of Experts) to a non‑technical audience using a restaurant‑chef division metaphor, within 200 words.

The explanation used vivid kitchen imagery and received praise for clarity.

Test Item 7: From video to code – reproduce three games shown in a video
Prompt: Recreate the three games (Fruit Ninja, Plants vs. Zombies, claw machine) demonstrated in the video <em>/Users/abc/Movies/jianyingvideo/12月24日(2).mp4</em>.

The model generated simplified versions of the games, then refined the claw‑machine UI based on additional prompts, ultimately achieving a near‑complete replica.

Conclusion

Across all seven test scenarios, Kimi K2.5 combined with Kimi Code demonstrated strong engineering practicality. Rather than merely topping leaderboards, the model consistently handled complex logic, multimodal inputs, and real‑world development workflows, suggesting that Chinese AI agents are closing the gap between demo‑level tools and production‑ready assistants.

Large Language ModelAI AgentmultimodalKimi
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.