How 78 Samples Outperform 10,000: The LIMI Breakthrough in Agent AI
The paper introduces the LIMI framework, which achieves state‑of‑the‑art agent performance on AgencyBench using only 78 carefully crafted samples—outperforming baseline models trained on thousands of examples—by focusing on high‑quality, strategic data construction and demonstrating superior generalization across code, research, and tool‑use tasks.
Introduction
The paper introduces the LIMI framework, which trains AI agents using only 78 carefully curated samples while achieving superior performance on the AgencyBench benchmark.
Motivation and Problem Definition
Modern AI agents are expected to act as autonomous workers that can plan, execute, and iterate on complex tasks. Conventional approaches assume that larger datasets inevitably improve agent intelligence, leading to high training costs and limited interpretability.
LIMI Framework Overview
LIMI adopts a “less‑is‑more” philosophy inspired by LIMA (few‑shot alignment) and LIMO (few‑shot mathematical reasoning). The central hypothesis is that a small set of high‑quality examples that capture essential agent behaviors can induce strong capabilities.
Strategic Data Construction
Query Pool Construction – Real + Synthetic Dual Track
Real queries : 60 tasks collected from real development and research scenarios (e.g., code repair, dataset search) to ensure ecological validity.
Synthetic queries : 18 tasks generated by a GPT‑5‑like model that analyses high‑quality GitHub pull requests, synthesises realistic task descriptions, and filters them by code complexity and domain coverage.
These 78 queries form the entire training set.
Trajectory Collection – Full Interaction Sequences
For each query a complete multi‑turn interaction (trajectory) between the AI and a human is recorded. Each trajectory contains three key actions:
Model inference : the AI’s reasoning process.
Tool invocation : execution of code, data search, etc.
Environment feedback : tool results or human corrections.
The average trajectory length is 42.4 K tokens, providing dense learning signals.
Experimental Design and Evaluation
Benchmarks
AgencyBench : a comprehensive agent benchmark covering 10 tasks across code development, scientific analysis, and tool usage.
Generalization benchmarks : tau2‑bench (tool use), EvalPlus (code generation), DS‑1000 (data‑science), SciCode (scientific computing).
Training and Comparison Settings
Baseline models: GLM‑4.5, Kimi‑K2, DeepSeek‑V3.1, Qwen‑3, etc.
LIMI variants: GLM‑4.5 and GLM‑4.5‑Air fine‑tuned on the 78‑sample dataset.
Contrast experiment: the same base models trained on a large agent dataset (AFM‑CodeAgent‑SFT, 10 000 samples) for a fair data‑size comparison.
Key Findings and Result Analysis
Performance on AgencyBench : LIMI achieves a 73.5 % average score, far exceeding all baselines (GLM‑4.5 45.1 %, Kimi‑K2 24.1 %, DeepSeek‑V3.1 11.9 %). The first‑round functional completion improves by over 30 percentage points.
LIMI’s 73.5 % average score significantly surpasses all baseline models, especially improving first‑round functionality by more than 30 points.
Data‑efficiency breakthrough : Using only 78 samples, LIMI outperforms a GLM‑4.5 model trained on 10 000 samples (73.5 % vs. 47.8 %), a 128‑fold reduction in data with a 25.7‑point absolute gain (53.7 % relative improvement).
Generalization : LIMI leads baselines on code generation (EvalPlus‑HumanEval 92.1 %), tool use (tau2‑bench‑retail 45.6 %) and maintains an advantage in a pure inference mode without tool calls (50.0 % vs. 48.7 %).
Case Studies – Real‑World Agent Tasks
Code Development
Task 1 – Build a C++ chat system: baseline models fail on the sub‑task of chat‑history storage, while LIMI completes the task without errors.
Task 3 – Develop a Gomoku game: baselines stumble on board rendering and win detection; LIMI succeeds on all components except the AI difficulty module.
Research Workflow
Task 7 – Search Hugging Face datasets: LIMI returns more relevant datasets and receives higher expert scores.
Task 8 – Fit a mathematical equation: LIMI achieves a loss of 5.95e‑7 on the first attempt, whereas baselines need multiple interventions to reach 1.14e‑6.
Task 9 – NBA player reasoning: LIMI answers correctly on most sub‑tasks with fewer reasoning steps and faster responses.
Discussion
Agent Efficiency Principle : The paper proposes that agent autonomy stems from strategic selection of high‑quality behavior demonstrations rather than sheer data volume, challenging the conventional scaling law in AI development.
Comparison with traditional methods : Conventional approaches rely on massive data and reinforcement learning, incurring high cost and low interpretability. LIMI demonstrates that a minimalist dataset can achieve stronger generalization, offering a viable path for resource‑constrained agent development.
Conclusion
LIMI shows that agent intelligence can be cultivated efficiently by focusing on the essence of agent behavior. With only 78 curated samples, it surpasses models trained on orders of magnitude more data, reducing training cost while enhancing controllability and explainability. The approach is extensible to other domains and can be combined with automated data‑generation techniques to further advance few‑shot agent AI.
Code example
本文
约2300字
,建议阅读
5
分钟
论文提出 LIMI 框架,以 78 个精炼样本实现 Agent 智能突破,效率超万样本基线模型。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
