Kimi K2.6 Open-Source Model Achieves 12-Hour Continuous Coding with 300 Parallel Agents
Moonshot's newly released Kimi K2.6 open-source model tops several benchmarks, supports over 4,000 tool calls in a single 12‑hour task, scales to 300 parallel sub‑agents, and introduces new front‑end, proactive agent, and Claw Groups capabilities while still lagging on visual‑reasoning tasks.
Moonshot quietly released the open‑source Kimi K2.6 model, reporting benchmark scores of 58.6 on SWE‑Bench Pro (versus GPT‑5.4 57.7 and Claude 53.4), 54.0 on HLE with tools (leading the field), and an F1 of 92.5 on DeepSearchQA, surpassing GPT‑5.4's 78.6.
On other leaderboards Gemini 3.1 Pro remains top for BrowseComp and Terminal‑Bench 2.0, while GPT‑5.4 still outperforms K2.6 on visual‑reasoning tasks such as MathVision and V*.
Although K2.6 does not win every benchmark, it secures a SOTA position among open‑source models and closes the gap with closed‑source contenders on real‑code evaluation suites.
For long‑horizon coding, K2.6 supports more than 4,000 tool calls per task across Rust, Go, and Python, covering front‑end, DevOps, and performance‑optimization scenarios. The blog cites two examples: Qwen3.5‑0.8B inference throughput improved from 15 tokens/s to 193 tokens/s, and the exchange‑core project saw a 185 % median and 133 % peak throughput increase.
The front‑end upgrade adds hero‑section video embedding, WebGL shaders, GSAP + Framer Motion animations, and Three.js 3D scenes, moving beyond the static pages typical of earlier open‑source models.
Agent Swarm scaling jumps from a maximum of 100 sub‑agents and 1,500 steps in K2.5 to 300 sub‑agents and 4,000 steps in K2.6, enabling a single command to coordinate hundreds of agents editing over a hundred files.
Proactive Agents serve infrastructure scenarios; K2.6 acts as the base model for OpenClaw and Hermes Agent, with Moonshot's RL team demonstrating a five‑day continuous autonomous run for monitoring, event response, and system operations.
Claw Groups, a research preview, lets users combine their own agents, external bots, and human participants in one workgroup, with adaptive scheduling that assigns tasks to the most suitable executor—described as a hybrid of Feishu and Slack for agents.
Deployment channels include a web UI at kimi.com (both Agent and Chat modes), a synchronized mobile app, an API at platform.moonshot.ai, and open‑source weights and code hosted on Hugging Face under the moonshotai account. Production‑grade coding tasks are recommended to use the Kimi Code CLI.
Pricing is not disclosed officially; community comparison suggests coding ability comparable to GPT‑5.4 at roughly 76 % lower cost than Claude Opus 4.7, and the model is 100 % open‑source.
Key strengths are its open‑source SOTA standing, aggressive 12‑hour continuous coding with 4,000+ tool calls, and the innovative Claw Groups collaboration model. Weaknesses include inferior performance on visual‑reasoning benchmarks, lower multilingual SWE‑Bench scores than Claude Opus, and unproven stability of 300 parallel sub‑agents in real‑world workloads.
Practically, a 12‑hour continuous run imposes significant cost, network stability, and orchestration overhead; many teams may achieve better ROI by stabilizing shorter 30‑minute to 2‑hour agents.
The rapid iteration from K2 to K2.5 to K2.6—doubling sub‑agent counts and step limits while narrowing benchmark gaps—combined with upcoming releases from competitors like DeepSeek, suggests the open‑source arena is eroding the assumption that only closed‑source models can handle long‑horizon tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
