How Sakana’s Unconventional AI Orchestrator Fugu Beats Fable 5 in Code Benchmarks
Japanese startup Sakana’s new multi‑agent orchestration system, Fugu, combines publicly available models to deliver code‑generation performance that surpasses closed‑source rivals like Fable 5, offering two versions, detailed benchmark results, qualitative use‑case demos, pricing options, and an analysis of its engineering trade‑offs.
Core Concept
Fugu is a multi‑agent orchestration system implemented as an LLM. When a request arrives, the model decides whether to answer directly or to assemble a team of specialist models, invoking them (including recursive calls to itself) to solve complex tasks.
Architecture
Technical Foundations
Two ICLR 2026 papers provide the underlying mechanisms:
TRINITY : a lightweight evolutionary coordinator that assigns Thinker, Worker, and Verifier roles to models and dynamically allocates tasks.
Conductor : a reinforcement‑learning trained component that discovers natural‑language collaboration strategies for the agents.
Versions
Fugu : balances performance and latency for everyday coding assistants and chatbots; supports exclusion of specific models for compliance.
Fugu Ultra : coordinates a larger pool of expert models for high‑difficulty workloads such as Kaggle competitions, paper reproduction, and security analysis.
Both versions are accessed through an OpenAI‑compatible API without code changes.
Benchmark Results
Scores (higher is better) on selected benchmarks:
SWE Bench Pro – Fugu 59.0, Fugu Ultra 73.7 , Opus 4.8 69.2, Gemini 3.1 Pro 54.2, GPT 5.5 58.6
TerminalBench 2.1 – Fugu 80.2, Fugu Ultra 82.1 , Opus 4.8 74.6, Gemini 3.1 Pro 70.3, GPT 5.5 78.2
LiveCodeBench – Fugu 92.9, Fugu Ultra 93.2 , Opus 4.8 87.8, Gemini 3.1 Pro 88.5, GPT 5.5 85.3
LiveCodeBench Pro – Fugu 87.8, Fugu Ultra 90.8 , Opus 4.8 84.8, Gemini 3.1 Pro 82.9, GPT 5.5 88.4
Humanity’s Last Exam – Fugu 47.2, Fugu Ultra 50.0 , Opus 4.8 49.8, Gemini 3.1 Pro 44.4, GPT 5.5 41.4
GPQA‑D – Fugu 95.5 , Fugu Ultra 95.5 , Opus 4.8 92.0, Gemini 3.1 Pro 94.3, GPT 5.5 93.6
Qualitative Case Studies
AutoResearch / LLM training optimization : Fugu Ultra executed 123 experiments on a single H100 GPU (≈14 h), achieving an average bits‑per‑byte (BPB) of 0.9774, surpassing all baselines.
Rubik’s Cube Solver : Solved 300 random cubes with an average of 19.72 moves (Model A: 19.76). Results: 7 wins, 293 draws, 0 losses; two other models crashed.
CAD Mechanical Iris : Generated CAD parts with clean rotation and aperture; competing models produced gaps or weak linkages.
Blindfold Chess : Defeated three frontier models and a 2100‑Elo Stockfish engine in a memory‑only match without a visible board.
Stock Trading Simulation : Grew $10 000 to $11 943 (+19.43 %) over 50 weeks, outperforming competing models that stayed below +15 %.
Analysis
Potential advantages of the orchestration approach:
Continuous integration of newly released high‑quality models, allowing system capability to grow with the ecosystem.
Reduced vendor lock‑in, which mitigates risks from geopolitical export‑control restrictions.
Task‑specific “model‑punch‑card” combinations can outperform any single model.
Observed challenges include orchestration latency, increased cost, added system complexity, and the opacity of the model‑selection logic.
Conclusion
Fugu demonstrates that a meta‑model capable of dynamically coordinating specialist models can achieve performance comparable to leading closed‑source frontier models without relying on a single large model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
