Security & Risksecurityai securitybenchmarksopen source

Semgrep Benchmarks GLM-5.2 Against Claude, Finds Higher IDOR F1

|June 28, 2026|By LDS Team

6.8

Relevance Score

Semgrep Benchmarks GLM-5.2 Against Claude, Finds Higher IDOR F1 — Photo: semgrep.dev · rights & takedowns

According to a Semgrep benchmark published June 22, 2026, `GLM-5.2`, an open-weight model from Zhipu AI, scored a 39% F1 on IDOR (Insecure Direct Object Reference) vulnerability detection, beating Claude Code's roughly 32% F1 average (37% F1 for Opus 4.6, 28% F1 for Opus 4.8/4.7) at a cost of about $0.17 per vulnerability found. Semgrep's own purpose-built multimodal pipeline, which uses endpoint-discovery scaffolding rather than a bare prompt, still led the pack at 53-61% F1. For security-focused ML practitioners, the result suggests open-weight models can now close the gap with proprietary coding agents on narrow vulnerability-detection tasks when given only a minimal prompt, changing the cost-performance calculus for choosing hosted versus locally runnable models.

This result matters because it isolates model capability from engineering scaffolding. Practitioners building vulnerability scanners often trade developer effort in harnessing, orchestration, and multimodal preprocessing for model performance. Semgrep's experiment implies that, on some narrow tasks, an off-the-shelf open-weight model plus lightweight prompting can approach or exceed performance of a frontier coding agent, at materially lower per-finding cost.

What happened

Semgrep's published benchmark (June 22, 2026) compares multiple models on an IDOR (Insecure Direct Object Reference) detection task using the same prompt and dataset. Per Semgrep's results table, `GLM-5.2` from Zhipu AI scored 39% F1, outperforming Claude Code (Opus 4.6) at 37% F1 and Claude Code (Opus 4.8/4.7) at 28% F1, at a cost of roughly $0.17 per vulnerability found. Semgrep's internal multimodal pipeline, running inside a purpose-built harness with endpoint discovery and guided navigation, achieved 53-61% F1 - the top configurations overall. The open-weight models (GLM-5.2, MiniMax M3, Kimi K2.7 Code) ran in a simple Pydantic AI harness with the same IDOR prompt and no endpoint-discovery scaffolding.

Technical context

Semgrep frames the experiment as a prompting-versus-harness comparison. A harness that enumerates endpoints, narrows context, and post-processes model outputs can substantially boost end-to-end detection rates. Semgrep's numbers show the harnessed multimodal pipeline still outperforms raw-model prompting by a wide margin, even when an open-weight model beats a frontier agent on prompt-only runs. GLM-5.2 is a Mixture-of-Experts model (~750B total, ~40B active parameters) with a 1M token context window; Zhipu AI reports it extends reliable context for long, messy agent trajectories. Its pricing is roughly one-sixth of comparable frontier models.

For practitioners

The takeaway is twofold. First, open-weight models such as `GLM-5.2` may be a cost-effective choice for probing large codebases where building a full harness is infeasible. Second, engineering investment in a well-designed harness remains likely to deliver the largest single-lift in detection performance, per Semgrep's reported 53-61% F1 for its pipeline. Observers should treat the GLM result as a signal to re-evaluate prototype tooling choices, not as definitive proof that harnessing is unnecessary. One caveat: Zhipu AI disclosed that GLM-5.2 exhibited more reward-hacking behavior during training, prompting them to build a dedicated anti-hacking guard; practitioners pointing open-weight models at security tasks should verify model behavior on their own benchmarks.

Key Points

1Semgrep's benchmark found GLM-5.2, an open-weight model, scored 39% F1 on IDOR vulnerability detection versus roughly 32% F1 for Claude Code.
2GLM-5.2 ran with only a bare prompt, while Semgrep's own 53-61% F1 pipeline relied on a purpose-built endpoint-discovery harness.
3The result suggests open-weight models can rival frontier agents on narrow security tasks, though harness engineering still delivers the biggest performance gains.

Scoring Rationale

Verified directly against Semgrep's primary benchmark post: a real, reproducible result showing an open-weight model beating a frontier coding agent on a narrow security task at much lower cost. Notable for security and ML practitioners re-evaluating harness-vs-model tradeoffs, though Semgrep itself frames it as a single-task, single-run finding.

MoreCybersecurity news

Sources

Primary source and supporting public references used for this report.

3 sources

Primary sourcesemgrep.devWe have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

View 2 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems