GLM 5.2 Beats Claude in IDOR Security Benchmark with 39% F1

Semgrep’s benchmark shows that the open‑source GLM 5.2 model, using only a unified prompt and a lightweight Pydantic AI scheduler, achieves a 39% F1 score on IDOR vulnerability detection—outperforming Claude Code’s best 37.4% while costing only about $0.17 per discovered flaw.

21CTO
21CTO
21CTO
GLM 5.2 Beats Claude in IDOR Security Benchmark with 39% F1

Experimental Design

Dataset – real‑world open‑source code collection previously used in Semgrep security benchmarks.

Metric – F1 score (harmonic mean of precision and recall) computed on true vulnerability instances.

Prompt – a single instruction describing IDOR characteristics and a search strategy, applied uniformly to all models.

Test groups:

Semgrep multimodal pipeline with a proprietary enhancement framework (used with GPT‑5.5 and Claude Opus 4.8).

Claude Code accessed via the official SDK with the same prompt.

Open‑source weight models (GLM 5.2, MiniMax M3, Kimi K2.7 Code) run only with a lightweight Pydantic AI scheduler, without any interface enumeration or code‑filtering enhancements.

GLM 5.2 Model Details

GLM 5.2 is Z‑AI’s latest open‑source model released under an MIT‑style license. It uses a Mixture‑of‑Experts architecture with ~7.5 trillion total parameters, activating ~400 billion per token, and supports a 1 million‑token context window. The per‑token cost is roughly one‑sixth that of comparable closed‑source models.

Standard code‑benchmark scores: Terminal‑Bench 2.1 81.0, SWE‑bench Pro 62.1, indicating code‑reasoning performance close to top closed models.

Results (F1 scores)

Semgrep multimodal pipeline (GPT‑5.5) – 61 %.

Semgrep multimodal pipeline (Claude Opus 4.8) – 53 %.

GLM 5.2 (no enhancement framework) – 39 %.

Claude Code (Opus 4.6) – 37 %.

Claude Code (Opus 4.8) – 28 %.

MiniMax M3 – 23 %.

Kimi K2.7 Code – 22 %.

GLM 5.2’s per‑vulnerability inference cost is about $0.17, roughly one‑sixth of the cost for top‑tier closed models.

Key Observations

The dedicated Semgrep enhancement framework provides a decisive performance lift, accounting for the large gap between the top two pipeline results and the bare‑model scores.

Even without any augmentation, GLM 5.2 surpasses Claude Code on the IDOR task, demonstrating that an open‑source weight model can match or exceed closed‑source counterparts when prompt engineering is effective and inference cost is low.

Example IDOR Endpoint Used in Benchmark

@app.route('/user/<int:user_id>')
def get_user(user_id):
    user = User.query.get_or_404(user_id)
    return jsonify(user.to_dict())

Limitations

The study focuses exclusively on the IDOR vulnerability type, a single fixed dataset, and a one‑shot evaluation. Results may differ for other vulnerability classes (e.g., SSRF) or larger, more diverse corpora.

Conclusion

The primary performance divider in code‑security AI is the presence of a specialized scheduling framework rather than the raw capability of the underlying language model. Open‑source models like GLM 5.2 can already deliver competitive security‑analysis results at a fraction of the cost and can be deployed offline for confidential environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI securityClaudeopen-source modelF1 scoreIDORGLM-5.2Semgrep
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.