Security & Riskmulti agentsecurity automationvulnerability researchglm 5.1

Tencent Xuanwu Atuin AI Achieves Strong CyberGym Results

||By LDS Team
6.3
Relevance Score
Tencent Xuanwu Atuin AI Achieves Strong CyberGym Results

Tencent XLab says its Xuanwu Atuin AI system, built on the open-weight GLM-5.1 model, solved 1,265 of 1,506 tasks - an 84.0% pass@1 score - on the CyberGym Level 1 cybersecurity benchmark, per a July 2, 2026 company blog post. That would edge out the public CyberGym leaderboard's current leader, Claude Mythos Preview (83.1%), though the figure is self-reported by Tencent and not yet independently verified by outside benchmarking. The reported gain over GLM-5.1's own public leaderboard score of 68.7% - a figure independently confirmed on llm-stats.com - is attributed to a multi-agent architecture: a manager model coordinating specialized subagents, encoded standard operating procedures, and persistent task state. For security teams building automated vulnerability-discovery pipelines, the result signals that orchestration engineering, not just base-model capability, can meaningfully move real-world benchmark performance.

For security teams weighing investment in agent orchestration versus waiting on better base models, Tencent XLab's benchmark offers a rare same-model comparison: running GLM-5.1 through a standard coding-agent setup scores 68.7% pass@1 on CyberGym Level 1 - a figure independently confirmed on the public llm-stats.com leaderboard - while the same underlying model wrapped in Tencent's purpose-built multi-agent security architecture reportedly reaches 84.0%. That 15.3-point gap is attributed entirely to orchestration engineering rather than model capability, though the 84.0% figure itself is currently a vendor self-report that has not yet been corroborated by independent benchmarking or third-party reproduction.

What happened

According to a July 2, 2026 Tencent XLab engineering blog post, the GLM-5.1-powered Xuanwu Atuin AI system solved 1,265 of 1,506 tasks (84.0% pass@1) on the CyberGym Level 1 benchmark. Tencent XLab reports that Zhipu AI's own GLM-5.1 model, run with Claude Code on the same underlying model, scored 68.7% pass@1 on the same benchmark - a figure that matches GLM-5.1's listed score on the independent llm-stats.com CyberGym leaderboard, where it ranks as the top open-weight model. For context, the leaderboard's current overall leader is Anthropic's Claude Mythos Preview at 83.1%, meaning Tencent's reported 84.0% would surpass it if independently confirmed.

Technical context

Tencent XLab describes Xuanwu Atuin AI as a multi-agent security analysis system that reasons over source code, binaries, and JavaScript bundles to produce exploit evidence. A manager component maintains campaign state, tracks evidence gaps and failed hypotheses, and decomposes tasks across subagents handling target modeling, code analysis, vulnerability reasoning, exploit construction, verification, and review. The system encodes standard operating procedures and TODO-style workflow state, with monitoring hooks designed to detect stalling or drift and redirect subagents back to the intended process, preserving partial progress across failed attempts.

For practitioners

The result argues for treating orchestration - state tracking, SOP enforcement, and verification hooks - as a first-class engineering investment for security automation, separate from base-model selection. Teams building vulnerability-discovery or red-team pipelines may get more practical lift from workflow design than from swapping in a newer model alone.

What to watch

The 84.0% figure comes from a single vendor blog post published the same day as this report, with no independent replication yet. Watch for third-party reproduction on CyberGym or comparable benchmarks, and for whether Xuanwu Atuin AI's exploit quality holds up under red-team validation rather than automated scoring alone.

Key Points

  • 1Tencent XLab reports its Xuanwu Atuin AI, built on GLM-5.1, scored 84.0% pass@1 on CyberGym Level 1, versus GLM-5.1's independently confirmed 68.7% baseline score.
  • 2The 15.3-point gain is attributed to multi-agent orchestration - a manager coordinating subagents with encoded SOPs and persistent task state - not a better base model.
  • 3The 84.0% figure is a same-day vendor self-report with no independent verification yet, so practitioners should treat it as provisional pending outside reproduction.

Scoring Rationale

Notable for security-automation practitioners since it quantifies an orchestration-driven gain on a recognized benchmark and includes one independently-confirmable data point (GLM-5.1's 68.7% baseline), but the headline 84.0% figure is a same-day, single-source vendor claim with no independent verification yet, so the score is held below 'major' pending outside reproduction.

Sources

Public references used for this report.

2 sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems