14 min read

How AI Multi‑Agent Systems Are Revolutionizing Code Security Audits

This article explores how Wukong's AI‑driven multi‑agent architecture dramatically improves code security auditing by addressing context loss, scheduling imbalances, and integrating a data‑flywheel that turns bad cases into continuous model improvements, illustrated by a real NVIDIA Megatron‑LM vulnerability fix.

Tencent Technical Engineering

Nov 7, 2025

How AI Multi‑Agent Systems Are Revolutionizing Code Security Audits

Human Can't Read All Code Anymore

When code complexity explodes, how can code security audits become efficient and rational? This article starts from Wukong Agent receiving NVIDIA’s official thanks and introduces a multi‑agent architecture that closes the loop, tackling real‑world challenges such as context breakage and scheduling imbalance.

AI’s purpose is not to replace humans but to amplify human judgment, bringing code security audit back to the human aspects of analysis, reasoning, and decision‑making.

Initially, code security audits relied on expert intuition; as the number of experts grew, experience was codified into static analysis tools (SAST). Both approaches struggle when projects span hundreds of thousands of files and millions of lines of code, because alerts cannot keep up with code complexity.

NVIDIA Thanks

During a routine AI‑framework security audit, Wukong Agent detected an unsafe deserialization call in pretrain_gpt.py of NVIDIA’s Megatron‑LM project. The use of yaml.load() could lead to command execution. The fix is simply replacing it with yaml.safe_load().

📄 Security Bulletin: NVIDIA Megatron‑LM – September 2025 CVE‑2025‑23348 — “Malicious data in pretrain_gpt may cause code injection, leading to code execution, privilege escalation, and data tampering.” Severity: High (CVSS 7.8)

The official patch was released in versions 0.13.1 / 0.12.3, completing a full vulnerability‑to‑fix closed loop.

Wukong Agent Architecture and Iteration Efficiency

The core of Wukong Agent is not a “big model” but fast iteration. We aim to turn every false‑positive, false‑negative, and re‑test into training data for the next version.

When vulnerability discovery depends more on the maturity of a closed‑loop system than on a single model’s performance, we shift focus to architecture design and a data‑engineer flywheel.

1. Architecture: Collaborative Intelligent Army

We decompose code‑audit tasks into a multi‑agent workflow where experts set direction and agents execute. Five agents work together:

Client Agent : user entry point, receives tasks, generates configs, returns results.

Remote Agent : plans and routes tasks, breaking complex audits into executable subtasks.

Audit Agent : core vulnerability scanner, performs multi‑level analysis on code snippets and whole projects.

Review Agent : re‑examines detection results using multiple prompts and scoring to confirm true vulnerabilities.

Fix Agent : generates remediation suggestions, validates patches, and closes the vulnerability‑to‑fix loop.

In continuous testing of open‑source projects, this collaboration yielded roughly an 80% reduction in scan time and more than a 60% drop in false‑positive rates.

2. Review: Bottlenecks and Optimizations

(1) Accuracy Bottleneck – Context Breakage

Large models often miss essential context in complex projects, leading to broken analysis chains. Two main causes are:

Framework complexity (deep inheritance, dynamic imports, async logic) truncates semantic chains.

Specialized internal code patterns lack coverage in public corpora.

We introduced systematic “Context Engineering”:

Enhanced context extraction : custom templates capture inheritance, decorators, and framework hooks.

Knowledge completion : a RAG‑like mechanism retrieves internal knowledge when unknown calls are detected, giving the model a local memory.

After these improvements, most bad cases are resolved in the second iteration.

(2) Agent Scheduling Imbalance – Serial Degradation

Although the design expects parallel cooperation, in practice the main agent often monopolizes tasks, turning the system into a single‑agent pipeline and degrading performance.

We plan to address this with a reinforcement‑learning scheduler that learns optimal task‑agent assignments from a small set of high‑quality annotations, enabling agents to proactively hand off work.

3. Data Flywheel: From Bad Cases to Model Iteration

Every day we extract representative “Bad Cases” from real scans, feed them through model verification and human QA, and feed the curated data back into training and evaluation pipelines. This three‑stage pipeline (data production, fine‑tuning, evaluation) has raised the closed‑loop resolution rate from ~30% to over 97%.

4. AI Infra Reality – Subtle Security Gaps

AI infrastructure is not inherently secure. Common hidden risks include unchecked configuration files, legacy serialization logic, drifting dependencies, and default components in popular frameworks.

Examples of real‑world gaps:

LLaMA‑Factory model loading chain.

MS‑SWIFT framework command concatenation points.

LangChain’s default EverNoteLoader component.

These findings show that the rapid evolution of AI infrastructure outpaces security adaptations, and many vulnerabilities reside outside the model itself.

Conclusion

AI agents relieve security teams from repetitive mechanical work, allowing them to focus on why designs are made rather than where they fail. Security becomes a long‑term competitive advantage, not a cost, and AI can serve both as a protected asset and an active guardian.