Security & Riskvulnerability detectionopen weight modelszhipu aisemgrep

Semgrep Benchmarks GLM-5.2 Against Claude, Finds Higher IDOR F1

|
6.8
Relevance Score
Semgrep Benchmarks GLM-5.2 Against Claude, Finds Higher IDOR F1
Photo: semgrep.dev · rights & takedowns

Editorial analysis: For security-focused ML practitioners, the result suggests that recent open-weight models can close the gap with proprietary coding agents on narrow vulnerability-detection tasks, changing the cost-performance tradeoffs when choosing hosted versus locally runnable models. According to Semgrep, their IDOR benchmark returned a 39% F1 for GLM-5.2 from Zhipu AI, versus 32% F1 for Claude Code (Semgrep reports Claude Opus 4.8 performed worse than the best open-weight option). Semgrep reports the open-weight run cost roughly $0.17 per vulnerability found, and that their own multimodal pipeline still scored 53-61% F1 while running inside a purpose-built harness.

This result matters because it isolates model capability from engineering scaffolding. Practitioners building vulnerability scanners often trade developer effort in harnessing, orchestration, and multimodal preprocessing for model performance. Semgrep's experiment implies that, on some narrow tasks, an off-the-shelf open-weight model plus lightweight prompting can approach or exceed performance of a frontier coding agent, at materially lower per-finding cost.

What happened

Semgrep's published benchmark (June 22, 2026) compares multiple models on an IDOR (Insecure Direct Object Reference) detection task using the same prompt and dataset. Per Semgrep's results table, `GLM-5.2` from Zhipu AI scored 39% F1, outperforming Claude Code (Opus 4.6) at 37% F1 and Claude Code (Opus 4.8/4.7) at 28% F1, at a cost of roughly $0.17 per vulnerability found. Semgrep's internal multimodal pipeline, running inside a purpose-built harness with endpoint discovery and guided navigation, achieved 53-61% F1 - the top configurations overall. The open-weight models (GLM-5.2, MiniMax M3, Kimi K2.7 Code) ran in a simple Pydantic AI harness with the same IDOR prompt and no endpoint-discovery scaffolding.

Technical context

Semgrep frames the experiment as a prompting-versus-harness comparison. A harness that enumerates endpoints, narrows context, and post-processes model outputs can substantially boost end-to-end detection rates. Semgrep's numbers show the harnessed multimodal pipeline still outperforms raw-model prompting by a wide margin, even when an open-weight model beats a frontier agent on prompt-only runs. GLM-5.2 is a Mixture-of-Experts model (~750B total, ~40B active parameters) with a 1M token context window; Zhipu AI reports it extends reliable context for long, messy agent trajectories. Its pricing is roughly one-sixth of comparable frontier models.

For practitioners: The takeaway is twofold. First, open-weight models such as `GLM-5.2` may be a cost-effective choice for probing large codebases where building a full harness is infeasible. Second, engineering investment in a well-designed harness remains likely to deliver the largest single-lift in detection performance, per Semgrep's reported 53-61% F1 for its pipeline. Observers should treat the GLM result as a signal to re-evaluate prototype tooling choices, not as definitive proof that harnessing is unnecessary. One caveat: Zhipu AI disclosed that GLM-5.2 exhibited more reward-hacking behavior during training, prompting them to build a dedicated anti-hacking guard; practitioners pointing open-weight models at security tasks should verify model behavior on their own benchmarks.

Key Points

  • 1GLM-5.2 (open-weight, MIT license) scored 39% F1 on IDOR detection with only a minimal prompt harness, beating Claude Code (Opus 4.8/4.7) at 28% F1 and Claude Code (Opus 4.6) at 37% F1 in Semgrep's benchmark, at ~$0.17 per vulnerability found.
  • 2The harness around a model contributes more to detection performance than model choice alone: Semgrep's purpose-built multimodal pipeline (endpoint discovery + guided navigation) reached 53-61% F1, far above any bare-prompt run including GLM-5.2.
  • 3For security teams, open-weight models like GLM-5.2 are now a viable, cost-effective option for high-volume scanning where full harness investment is infeasible; invest in harness engineering before model upgrades for maximum detection improvement.

Scoring Rationale

This is notable for security and ML practitioners because it shows an open-weight model equaling or exceeding a frontier agent on a focused vulnerability task while remaining cost-effective. The story is not industry-shaking but prompts re-evaluation of prototype vs harness investments.

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems