Information Security 9 min read

China’s Mysterious AI Security Team “MopMonk” Shocks the Industry with a 73% Success Rate

A previously unknown Chinese AI security group called MopMonk, operating without a website or corporate backing, posted a GitHub report that achieved a 73.1% vulnerability‑exploitation success rate, ranked seventh globally in the UC Berkeley‑run CyberGym benchmark, and demonstrated novel memory‑based multi‑agent techniques that signal China’s rising AI security prowess.

Black & White Path

Jul 2, 2026

China’s Mysterious AI Security Team “MopMonk” Shocks the Industry with a 73% Success Rate

CyberGym benchmark

CyberGym is an AI security benchmark created by a UC Berkeley team and accepted as an ICLR 2026 paper. It contains 1,507 real OSS‑Fuzz vulnerabilities. Agents must run offline, reason over millions of lines of code in large open‑source projects, and produce a proof‑of‑concept (PoC) that triggers the bug on the vulnerable commit but fails on the patched version.

Why the benchmark is hard

Agents need to locate entry points, generate valid inputs, and verify that the PoC works only on the vulnerable version, all without network access.

MopMonk architecture

1. Domestic MiniMax M3 base model

Ultra‑long context window : about one million tokens, enabling ingestion of an entire codebase.

Mixture‑of‑Experts sparse attention : maintains performance while reducing compute cost.

Strong programming ability : scores 59.0 % on SWE‑Bench Pro, 66.0 % on Terminal‑Bench 2.1, and 74.2 % on MCP Atlas.

2. Structured “Vulnerability Memory” system

The system stores key information in seven structured memory types, turning repeated trial‑and‑error into evidence‑driven convergence.

Vulnerability target memory : target vulnerability, success conditions, verification criteria.

Code‑path memory : confirmed entry points, harnesses, parsing chains, suspicious functions.

Input format memory : file format, field relationships, length constraints, boundary conditions.

Candidate PoC memory : candidate inputs, generation rationale, triggered behavior, mutation directions.

Negative evidence memory : non‑trigger attempts, unreachable paths, build failures, format errors.

Verification state memory : whether the PoC triggers a crash and failure reasons.

Next‑step constraint memory : concrete constraints that the next attempt must satisfy.

3. Multi‑agent collaborative exploration

Multiple agents share the same memory store. Each agent reads current evidence, tests hypotheses, and writes back new constraints or results. This design yields three direct effects:

Reduced duplicate work : failed paths are recorded and not retried.

Preserved negative evidence : non‑trigger attempts become constraints rather than being discarded.

Higher effective experiment density : more directions are explored within a limited budget.

Ranking results (CyberGym Level 1, 4‑hour timeout)

1 – Crystalline (Claude Opus 4.6) – 89.6 % – 2026‑06‑08

2 – MDASH (Multi‑model) – 88.4 % – 2026‑05‑12

3 – OpenAI Agent (GPT‑5.5‑Cyber) – 85.6 % – 2026‑06‑22

4 – Anthropic Agent (Claude Mythos Preview) – 83.1 % – 2026‑04‑07

5 – OpenAI Agent (GPT‑5.5) – 81.8 % – 2026‑04‑23

6 – OpenAI Agent (GPT‑5.4) – 79.0 % – 2026‑04‑23

7 – MopMonk Agent (MiniMax M3) – 73.1 % – 2026‑06‑29

Task‑time distribution and resource usage

<10 minutes: 39.95 %

10‑30 minutes: 23.95 %

30‑60 minutes: 7.76 %

1‑2 hours: 10.82 %

2‑3 hours: 0.86 %

3‑4 hours: 16.66 %

Total token consumption (including cache) reached 99,944,644,535 tokens, of which 2,091,474,371 were non‑cached. The number of LLM requests was 1,582,007.

Technical repository

https://github.com/MopMonkAI/MopMonkAgent

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark AI security vulnerability detection MiniMax M3 CyberGym MopMonk

Written by

Black & White Path

We are the beacon of the cyber world, a stepping stone on the road to security.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

CyberGym benchmark

Why the benchmark is hard

MopMonk architecture

1. Domestic MiniMax M3 base model

2. Structured “Vulnerability Memory” system

3. Multi‑agent collaborative exploration

Ranking results (CyberGym Level 1, 4‑hour timeout)

Task‑time distribution and resource usage

Technical repository

Black & White Path

How this landed with the community

Was this worth your time?

0 Comments

1. Domestic MiniMax M3 base model

Ranking results (CyberGym Level 1, 4‑hour timeout)