How an Open‑Source Orchestration Framework Lets Any LLM Find Zero‑Day Bugs
The article demonstrates that vulnerability discovery depends on a flexible orchestration harness rather than on exclusive frontier LLMs, using the open‑source IronCurtain framework with commercial and open‑weight models to reproduce a 27‑year‑old OpenBSD bug and uncover new zero‑day flaws in widely used projects, while analysing costs, workflow design, and security implications.
Myth of Frontier Models
Security narratives claim that only the newest frontier LLMs (e.g., Anthropic’s Mythos preview) can uncover decades‑old memory‑safety bugs such as the 1998 OpenBSD TCP SACK flaw. The core argument is that the decisive factor is the orchestration framework that manages model interactions, not the model itself.
IronCurtain Orchestration Framework
IronCurtain is a research prototype that enables structured AI‑driven security investigations. Workflows are defined as finite‑state machines (FSM) in pure YAML. An Orchestrator agent routes tasks based on an append‑only execution log, allowing each specialist agent to start from a clean context restored from disk.
The orchestrator never reads source code directly; it relies entirely on the log to guide the investigation until a final vulnerability report is produced. Running the workflow on a medium‑size codebase consumes roughly 10 million tokens, costing about $150 with Opus 4.6 or $30 with Sonnet 4.6 per run. GLM 5.1 on the Z.AI platform consumes about 27 million tokens per round, with pricing that makes the total cost comparable to Sonnet.
Reproducing the 1998 OpenBSD TCP SACK Bug
The workflow follows the principle first static hypothesis, then execution verification . A proof‑of‑concept (PoC) tool is generated to trigger the bug and provide empirical evidence beyond static analysis.
Initial runs failed because prompt engineering and log maintenance were insufficient; the container could not launch an OpenBSD VM, so the agent fell back to static analysis and produced a report without a PoC.
Iterating on the workflow revealed that early hypothesis exploration does not require a full VM; lightweight testing such as single‑function fuzzing suffices. Using Claude Code with Opus 4.6, the model performed a targeted fuzz test that isolated the bug after examining 4.3 billion sequence numbers, finding only two differing values at a 32‑bit signed‑integer boundary.
With the triggering condition identified, the model generated a QEMU‑based driver to execute the exploit on a real VM, successfully reproducing the kernel crash. Subsequent refinements added a hierarchical testing strategy (single‑function isolation, multi‑component tools, full VM verification), enabling fully automated runs without manual guidance.
Autonomous Discovery in Real‑World Projects
Applying the refined workflow to four widely deployed open‑source projects uncovered previously unreported bugs. Details are withheld pending upstream coordination and CVE assignment.
Case Study 1: Media Framework Vulnerability
Using Opus 4.6, the workflow identified a defect, generated multi‑component test tools, and after manual tuning of the environment produced a reliable PoC that was reported to the upstream maintainers.
Case Study 2: 18‑Year‑Old Integer Truncation Bug
Switching the model to GLM 5.1 via a LiteLLM gateway (keeping the IronCurtain FSM unchanged) led to the discovery of an integer‑truncation flaw that had persisted for 18 years in a memory‑allocation path. The model generated a PoC and a sanitizer‑validated test confirming exploitability.
To assess severity, Opus 4.7 with Claude Code was used for guided manual analysis, producing a controlled heap‑out‑of‑bounds read/write primitive. The model refused to continue the exploit due to policy guards, but the first two steps demonstrated a bypass of ASLR by reading a base pointer, supporting a high‑severity rating.
Three Key Observations
PoC generation is essential for defenders. Static‑only findings generate many false positives; an executable PoC quickly validates a vulnerability, allowing security teams to focus on real threats.
Orchestration provides the scaffolding, but model quality still matters. The lower bound of what can be extracted is set by the base model, yet open‑weight models now approach the performance of commercial offerings.
Economic factors now favor frequent, broad audits. Commercial API pricing ranges from $30 to $150 per investigation; open‑weight providers charge comparable rates per token, and self‑hosting can further reduce marginal costs after upfront investment.
Orchestration as a Double‑Edged Sword
Well‑funded adversaries already employ large‑scale orchestration workflows to hunt zero‑days without vendor policy constraints or API rate limits. The seven‑step exploit refusal illustrated this asymmetry: defenders encounter friction, while attackers using unrestricted open‑weight models do not.
Locally hosted open‑weight models remove the entry barrier, shifting responsibility to researchers much like traditional tools (Metasploit, nmap, Burp Suite, AFL) have historically faced. IronCurtain aims to bridge the gap by combining open‑source scaffolding with either local or publicly available models, enabling defenders to audit codebases before exploits can be automated.
GitHub repository: https://github.com/provos/ironcurtain
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Black & White Path
We are the beacon of the cyber world, a stepping stone on the road to security.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
