How an Open‑Source Orchestration Framework Lets Any LLM Find Zero‑Day Bugs

The article demonstrates that vulnerability discovery depends on a flexible orchestration harness rather than on exclusive frontier LLMs, using the open‑source IronCurtain framework with commercial and open‑weight models to reproduce a 27‑year‑old OpenBSD bug and uncover new zero‑day flaws in widely used projects, while analysing costs, workflow design, and security implications.

Black & White Path
Black & White Path
Black & White Path
How an Open‑Source Orchestration Framework Lets Any LLM Find Zero‑Day Bugs

Myth of Frontier Models

Security narratives claim that only the newest frontier LLMs (e.g., Anthropic’s Mythos preview) can uncover decades‑old memory‑safety bugs such as the 1998 OpenBSD TCP SACK flaw. The core argument is that the decisive factor is the orchestration framework that manages model interactions, not the model itself.

IronCurtain Orchestration Framework

IronCurtain is a research prototype that enables structured AI‑driven security investigations. Workflows are defined as finite‑state machines (FSM) in pure YAML. An Orchestrator agent routes tasks based on an append‑only execution log, allowing each specialist agent to start from a clean context restored from disk.

The orchestrator never reads source code directly; it relies entirely on the log to guide the investigation until a final vulnerability report is produced. Running the workflow on a medium‑size codebase consumes roughly 10 million tokens, costing about $150 with Opus 4.6 or $30 with Sonnet 4.6 per run. GLM 5.1 on the Z.AI platform consumes about 27 million tokens per round, with pricing that makes the total cost comparable to Sonnet.

Reproducing the 1998 OpenBSD TCP SACK Bug

The workflow follows the principle first static hypothesis, then execution verification . A proof‑of‑concept (PoC) tool is generated to trigger the bug and provide empirical evidence beyond static analysis.

Initial runs failed because prompt engineering and log maintenance were insufficient; the container could not launch an OpenBSD VM, so the agent fell back to static analysis and produced a report without a PoC.

Iterating on the workflow revealed that early hypothesis exploration does not require a full VM; lightweight testing such as single‑function fuzzing suffices. Using Claude Code with Opus 4.6, the model performed a targeted fuzz test that isolated the bug after examining 4.3 billion sequence numbers, finding only two differing values at a 32‑bit signed‑integer boundary.

With the triggering condition identified, the model generated a QEMU‑based driver to execute the exploit on a real VM, successfully reproducing the kernel crash. Subsequent refinements added a hierarchical testing strategy (single‑function isolation, multi‑component tools, full VM verification), enabling fully automated runs without manual guidance.

Autonomous Discovery in Real‑World Projects

Applying the refined workflow to four widely deployed open‑source projects uncovered previously unreported bugs. Details are withheld pending upstream coordination and CVE assignment.

Case Study 1: Media Framework Vulnerability

Using Opus 4.6, the workflow identified a defect, generated multi‑component test tools, and after manual tuning of the environment produced a reliable PoC that was reported to the upstream maintainers.

Case Study 2: 18‑Year‑Old Integer Truncation Bug

Switching the model to GLM 5.1 via a LiteLLM gateway (keeping the IronCurtain FSM unchanged) led to the discovery of an integer‑truncation flaw that had persisted for 18 years in a memory‑allocation path. The model generated a PoC and a sanitizer‑validated test confirming exploitability.

To assess severity, Opus 4.7 with Claude Code was used for guided manual analysis, producing a controlled heap‑out‑of‑bounds read/write primitive. The model refused to continue the exploit due to policy guards, but the first two steps demonstrated a bypass of ASLR by reading a base pointer, supporting a high‑severity rating.

Three Key Observations

PoC generation is essential for defenders. Static‑only findings generate many false positives; an executable PoC quickly validates a vulnerability, allowing security teams to focus on real threats.

Orchestration provides the scaffolding, but model quality still matters. The lower bound of what can be extracted is set by the base model, yet open‑weight models now approach the performance of commercial offerings.

Economic factors now favor frequent, broad audits. Commercial API pricing ranges from $30 to $150 per investigation; open‑weight providers charge comparable rates per token, and self‑hosting can further reduce marginal costs after upfront investment.

Orchestration as a Double‑Edged Sword

Well‑funded adversaries already employ large‑scale orchestration workflows to hunt zero‑days without vendor policy constraints or API rate limits. The seven‑step exploit refusal illustrated this asymmetry: defenders encounter friction, while attackers using unrestricted open‑weight models do not.

Locally hosted open‑weight models remove the entry barrier, shifting responsibility to researchers much like traditional tools (Metasploit, nmap, Burp Suite, AFL) have historically faced. IronCurtain aims to bridge the gap by combining open‑source scaffolding with either local or publicly available models, enabling defenders to audit codebases before exploits can be automated.

GitHub repository: https://github.com/provos/ironcurtain

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vulnerability researchLLM securityOpenBSDAI orchestrationzero‑day discoveryIronCurtain
Black & White Path
Written by

Black & White Path

We are the beacon of the cyber world, a stepping stone on the road to security.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.