AI Era Vulnerability Benchmark Revamp: 3,632 CVE Insights & VulnGym Release
Analyzing 3,632 high‑severity GitHub Advisory reports from 2025‑2026, the authors reveal a sharp rise in business‑logic flaws—especially in high‑star projects—prompting a redesign of vulnerability‑detection benchmarks, and introduce VulnGym, a real‑project, white‑box dataset with 400+ paths and detailed entry‑point, trace, and critical‑operation annotations.
Background and Motivation
With the emergence of AI‑driven code analysis tools such as Claude Security and Codex Security, vulnerability detection is shifting from merely finding syntactic defects to understanding project context and business‑risk exposure. Existing benchmarks still focus on traditional CWE categories and cannot assess a tool’s ability to locate cross‑file, cross‑process business‑logic flaws.
Statistical Findings from 3,632 CVEs
We examined 3,632 high‑severity (2,768 HIGH + 864 CRITICAL) GitHub Advisory reports from Jan 2025 to Apr 2026. Overall, business‑logic vulnerabilities account for 31.14 % (1,131/3,632). In 2025 the yearly proportion was 25.67 % (382/1,488); in the first four months of 2026 it rose to 34.93 % (749/2,144). For projects with ≥10 K stars, the share jumped from 26.11 % in 2025 to 39.59 % in early 2026, reaching 47.2 % in April 2026.
Monthly trends show a clear upward curve after October 2025, especially in high‑star repositories, suggesting a structural shift rather than random fluctuation.
Why the Shift?
We attribute the change to two supply‑side factors. First, AI coding assistants increasingly generate code that follows secure defaults, reducing classic injection‑type bugs. Second, these tools lack long‑term knowledge of project‑specific business rules, leading to “correct‑looking” code that nevertheless violates authorization, tenancy, or workflow constraints. An analysis of 36 top AI‑agent projects uncovered over 500 vulnerability reports, with business‑logic flaws comprising 41.3 %—about ten percentage points higher than the overall average.
Third, the AI ecosystem introduces novel flaw classes such as prompt injection and tool‑whitelist bypass, which fall outside traditional CWE taxonomies but are effectively business‑logic issues.
Detection‑Side Evolution
Recent security projects (Anthropic Project Glasswing, OpenAI Daybreak, Microsoft MDASH, AWS Security Agent) move beyond rule‑based scanning to understand cross‑file semantics, enabling the discovery of previously hidden business‑logic defects. Consequently, the rising share of such bugs in recent reports likely reflects improved detection rather than a pure increase in their occurrence.
Benchmark Mismatch
We evaluated mainstream benchmarks (Devign, BigVul, OWASP Benchmark, Juliet, CyberGym) against four dimensions required for project‑level business‑logic assessment: granularity (function/file/project), vulnerability type coverage, evaluation mode (white‑box vs black‑box), and support for cross‑module reasoning. All fall short in at least one dimension, whereas VulnGym satisfies all four.
VulnGym Design
VulnGym is a white‑box benchmark built from reviewed GitHub Advisory entries. After three filters—reviewed status, ≥10 K stars, and recent HIGH/CRITICAL severity—we retained 184 reports, extracting over 400 distinct vulnerability paths. Each sample is annotated with three manually verified elements: entry point, critical operation, and an ordered trace linking them.
entry_point ──→ trace[0], trace[1], …, trace[n] ──→ critical_operation
External entry Cross‑module data/control flow Sensitive code triggerThis fine‑grained annotation enables deterministic, explainable evaluation and distinguishes between a tool merely guessing a location and one that truly understands the vulnerability.
Use Cases
VulnGym can serve as a dataset, a common ruler for comparing (1) different LLM back‑ends (Claude, GPT, Gemini, etc.) with a fixed harness, (2) harness engineering choices (code chunk size, architecture hints, skill orchestration), and (3) full agent solutions (Claude Security, Codex Security, etc.). The benchmark thus quantifies current performance, identifies bottlenecks, and measures the impact of iterative improvements.
Community Invitation
The authors invite contributions of real‑world vulnerability samples, annotation corrections, and suggestions for expanding the label taxonomy and harness design, aiming to keep the benchmark aligned with the evolving threat landscape.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
