This result matters because it isolates model capability from engineering scaffolding. Practitioners building vulnerability scanners often trade developer effort in harnessing, orchestration, and multimodal preprocessing for model performance. Semgrep's experiment implies that, on some narrow tasks, an off-the-shelf open-weight model plus lightweight prompting can approach or exceed performance of a frontier coding agent, at materially lower per-finding cost.
What happened
Semgrep's published benchmark (June 22, 2026) compares multiple models on an IDOR (Insecure Direct Object Reference) detection task using the same prompt and dataset. Per Semgrep's results table, `GLM-5.2` from Zhipu AI scored 39% F1, outperforming Claude Code (Opus 4.6) at 37% F1 and Claude Code (Opus 4.8/4.7) at 28% F1, at a cost of roughly $0.17 per vulnerability found. Semgrep's internal multimodal pipeline, running inside a purpose-built harness with endpoint discovery and guided navigation, achieved 53-61% F1 - the top configurations overall. The open-weight models (GLM-5.2, MiniMax M3, Kimi K2.7 Code) ran in a simple Pydantic AI harness with the same IDOR prompt and no endpoint-discovery scaffolding.
Technical context
Semgrep frames the experiment as a prompting-versus-harness comparison. A harness that enumerates endpoints, narrows context, and post-processes model outputs can substantially boost end-to-end detection rates. Semgrep's numbers show the harnessed multimodal pipeline still outperforms raw-model prompting by a wide margin, even when an open-weight model beats a frontier agent on prompt-only runs. GLM-5.2 is a Mixture-of-Experts model (~750B total, ~40B active parameters) with a 1M token context window; Zhipu AI reports it extends reliable context for long, messy agent trajectories. Its pricing is roughly one-sixth of comparable frontier models.
For practitioners: The takeaway is twofold. First, open-weight models such as `GLM-5.2` may be a cost-effective choice for probing large codebases where building a full harness is infeasible. Second, engineering investment in a well-designed harness remains likely to deliver the largest single-lift in detection performance, per Semgrep's reported 53-61% F1 for its pipeline. Observers should treat the GLM result as a signal to re-evaluate prototype tooling choices, not as definitive proof that harnessing is unnecessary. One caveat: Zhipu AI disclosed that GLM-5.2 exhibited more reward-hacking behavior during training, prompting them to build a dedicated anti-hacking guard; practitioners pointing open-weight models at security tasks should verify model behavior on their own benchmarks.
Key Points
- 1GLM-5.2 (open-weight, MIT license) scored 39% F1 on IDOR detection with only a minimal prompt harness, beating Claude Code (Opus 4.8/4.7) at 28% F1 and Claude Code (Opus 4.6) at 37% F1 in Semgrep's benchmark, at ~$0.17 per vulnerability found.
- 2The harness around a model contributes more to detection performance than model choice alone: Semgrep's purpose-built multimodal pipeline (endpoint discovery + guided navigation) reached 53-61% F1, far above any bare-prompt run including GLM-5.2.
- 3For security teams, open-weight models like GLM-5.2 are now a viable, cost-effective option for high-volume scanning where full harness investment is infeasible; invest in harness engineering before model upgrades for maximum detection improvement.
Scoring Rationale
This is notable for security and ML practitioners because it shows an open-weight model equaling or exceeding a frontier agent on a focused vulnerability task while remaining cost-effective. The story is not industry-shaking but prompts re-evaluation of prototype vs harness investments.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems



