Security & Riskevaluationai researchllmsred teaming

Vera-Bench Tests Safety of Tool-Using LLM Agents

|July 5, 2026|By LDS Team

6.8

Relevance Score

Vera-Bench Tests Safety of Tool-Using LLM Agents — Photo: opengraph.githubassets.com · rights & takedowns

A July 2, 2026 arXiv paper introduced Vera-Bench, a 1,600-case safety benchmark for tool-using LLM agents, alongside an official GitHub harness. The authors say Vera generates executable safety cases from risk, attack, and environment taxonomies, then checks outcomes with deterministic verifiers grounded in observable state and tool-call evidence. They report 124 risk categories and a 93.9 percent average attack success rate under multi-channel attacks across OpenClaw, Hermes, Codex, and Claude Code. For practitioners, the useful signal is not a final vendor ranking; it is that agent safety can be regression-tested with sandboxed cases, explicit permissions, and inspectable failure artifacts.

Executable agent-safety tests are starting to look more like CI artifacts than policy checklists. The useful part of Vera-Bench is not only the reported attack-success number; it is the attempt to make agent failures reproducible through generated scenarios, sandboxed tool execution, and deterministic verifiers that inspect state instead of trusting an agent's explanation.

What happened

A July 2, 2026 arXiv paper introduced Vera, an automated framework for safety testing tool-using LLM agents, and the authors published an official GitHub repository for the benchmark harness. The paper says Vera-Bench contains 1,600 executable safety cases spanning 124 risk categories across three execution settings. It reports an average 93.9 percent attack success rate under multi-channel attacks across OpenClaw, Hermes, Codex, and Claude Code.

Technical context

The framework separates risk discovery, test-case construction, and adaptive execution. Its important design choice is evidence-grounded verification: cases are judged using environment state and tool-call artifacts where possible, with model self-report treated as weaker evidence. The GitHub repository also exposes the benchmark structure, execution modes, generated safety goals, and output artifacts such as attack plans, session logs, MCP logs, and verifier scripts.

For practitioners

The results should be treated as author-run research, not a settled leaderboard for production agents. Still, the testing shape is practical: permission boundaries, injected tool results, and observable side effects are exactly where enterprise agent rollouts tend to fail. Teams evaluating coding agents, browser agents, or internal workflow agents can use Vera-Bench as a prompt for their own regression suites.

What to watch

The next question is whether independent teams can reproduce the reported attack-success rates and adapt the cases to private tool stacks. If the harness proves portable, agent safety programs can move from policy review toward repeatable checks that run whenever tools, prompts, or model backends change.

Key Points

1Vera-Bench turns agent safety into executable cases with observable state and deterministic checks, not agent self-reports.
2The authors report 1,600 cases across 124 risk categories, with multi-channel attacks averaging 93.9 percent success.
3Teams can use the release as regression-test inspiration, but should treat the rankings as author-run research.

Scoring Rationale

This is notable AI safety infrastructure because it provides executable agent-safety cases and public code, not just a narrative benchmark. The score stays in the high-6 range because the evidence is still author-run research without independent reproduction, but the reported scale and tool-agent focus make it useful for practitioners.

MoreAI Evals news

Sources

Public references used for this report.

2 sources

arxiv.orgSafety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

github.comYunhao-Feng/Vera

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems