Blockchain 12 min read

How Agora Uncovered 15 Zero‑Day Deep Bugs in Consensus Protocols with an Industrial‑Grade Multi‑Agent Framework

The paper presents Agora, a hypothesis‑driven multi‑agent system that integrates domain knowledge with large‑model agents to automatically detect deep logic bugs in production‑level consensus protocols, discovering 15 previously unknown vulnerabilities across Raft, EPaxos, HotStuff and BullShark while outperforming GPT‑5.2, Claude 4.5 and other baselines at a fraction of the token cost.

Machine Heart

Jun 11, 2026

How Agora Uncovered 15 Zero‑Day Deep Bugs in Consensus Protocols with an Industrial‑Grade Multi‑Agent Framework

Problem

Consensus protocols such as Raft, EPaxos, HotStuff and BullShark have extremely complex state machines, making deep logic bugs (Deep Bugs) that span multiple execution phases and rely on global distributed state hard to detect with traditional testing or single‑agent LLM analysis.

Limitations of Single‑Agent LLMs

Even state‑of‑the‑art large models (GPT‑5.2, Gemini 3.0 Pro Preview, Claude Sonnet 4.5, Qwen‑3 Coder) can locate only low‑level implementation bugs; they fail to reason about protocol‑level invariants and produce zero detections of Deep Bugs in the evaluated consensus libraries.

Agora Framework

Agora introduces a hypothesis‑driven multi‑agent workflow that tightly couples domain‑specific invariants with large‑model agents. The system consists of three highly specialized agents:

Orchestrator Agent : maintains global state, records known vulnerabilities and extrapolates them into new exploit hypotheses.

Strategy Agent : injects distributed‑system knowledge to synthesize aggressive attack scenarios for both crash‑fault‑tolerant (CFT) and Byzantine‑fault‑tolerant (BFT) protocols.

TestGen Agent : generates concrete test code (supporting Go, Rust, etc.), executes the tests, and evaluates the results.

Agents communicate through a Succinct Memory & Communication layer that minimizes context transfer overhead, enabling rapid, language‑agnostic test generation and a reflection loop that captures execution logs for targeted self‑correction.

The core Harness architecture (see diagram) integrates the three agents into a closed‑loop testing pipeline:

Evaluation

Agora was evaluated on four major consensus implementations, including the production‑grade etcd library and the core components of the Sui blockchain. Results:

Discovered 15 previously unknown protocol‑level Deep Bugs covering execution divergence, monotonicity violations, topology defects and signature flaws.

Baseline large models detected 0/15 of these bugs despite consuming large token budgets.

Bug‑report precision: 73.9 % (false‑positive rate 26.1 %).

Average token consumption per critical bug: 5.32 M tokens (≈ $40).

Extensibility

The hypothesis‑driven multi‑agent paradigm is not limited to consensus protocols. The authors propose applying the same workflow to other hard‑to‑test domains such as database concurrency control, operating‑system kernel synchronization, and Web3 smart‑contract auditing.

Open‑source repository: https://github.com/0gfoundation/agora (provides reusable skills and plug‑in support for additional domains).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Blockchain Security Consensus Protocols LLM testing Multi-Agent AI Deep Bug Detection

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.