How Agora Uncovered 15 Zero‑Day Deep Bugs in Consensus Protocols with an Industrial‑Grade Multi‑Agent Framework
The paper presents Agora, a hypothesis‑driven multi‑agent system that integrates domain knowledge with large‑model agents to automatically detect deep logic bugs in production‑level consensus protocols, discovering 15 previously unknown vulnerabilities across Raft, EPaxos, HotStuff and BullShark while outperforming GPT‑5.2, Claude 4.5 and other baselines at a fraction of the token cost.
Problem
Consensus protocols such as Raft, EPaxos, HotStuff and BullShark have extremely complex state machines, making deep logic bugs (Deep Bugs) that span multiple execution phases and rely on global distributed state hard to detect with traditional testing or single‑agent LLM analysis.
Limitations of Single‑Agent LLMs
Even state‑of‑the‑art large models (GPT‑5.2, Gemini 3.0 Pro Preview, Claude Sonnet 4.5, Qwen‑3 Coder) can locate only low‑level implementation bugs; they fail to reason about protocol‑level invariants and produce zero detections of Deep Bugs in the evaluated consensus libraries.
Agora Framework
Agora introduces a hypothesis‑driven multi‑agent workflow that tightly couples domain‑specific invariants with large‑model agents. The system consists of three highly specialized agents:
Orchestrator Agent : maintains global state, records known vulnerabilities and extrapolates them into new exploit hypotheses.
Strategy Agent : injects distributed‑system knowledge to synthesize aggressive attack scenarios for both crash‑fault‑tolerant (CFT) and Byzantine‑fault‑tolerant (BFT) protocols.
TestGen Agent : generates concrete test code (supporting Go, Rust, etc.), executes the tests, and evaluates the results.
Agents communicate through a Succinct Memory & Communication layer that minimizes context transfer overhead, enabling rapid, language‑agnostic test generation and a reflection loop that captures execution logs for targeted self‑correction.
The core Harness architecture (see diagram) integrates the three agents into a closed‑loop testing pipeline:
Evaluation
Agora was evaluated on four major consensus implementations, including the production‑grade etcd library and the core components of the Sui blockchain. Results:
Discovered 15 previously unknown protocol‑level Deep Bugs covering execution divergence, monotonicity violations, topology defects and signature flaws.
Baseline large models detected 0/15 of these bugs despite consuming large token budgets.
Bug‑report precision: 73.9 % (false‑positive rate 26.1 %).
Average token consumption per critical bug: 5.32 M tokens (≈ $40).
Extensibility
The hypothesis‑driven multi‑agent paradigm is not limited to consensus protocols. The authors propose applying the same workflow to other hard‑to‑test domains such as database concurrency control, operating‑system kernel synchronization, and Web3 smart‑contract auditing.
Open‑source repository: https://github.com/0gfoundation/agora (provides reusable skills and plug‑in support for additional domains).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
