AI Agent Triggers Nuclear Strike in Civilization VI

Liam Wilkinson, who works at the Tony Blair Institute and previously worked at Number 10 (the British Prime Minister's office), published a detailed account of CivBench, an open-source benchmark that connects frontier AI models to Civilization VI via a custom MCP server with 76 tools, to test long-horizon strategic reasoning. The most dramatic episode: a Claude Opus 4.6-controlled Portugal spent 50 turns planning and executing a nuclear strategy to stop France's cultural victory, striking Toulouse at turn 305 and again at turn 311 - then losing anyway when France won by diplomacy, 20-18 on World Congress votes. Four frontier models were tested: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Kimi K2.5. Wilkinson identifies two persistent failure modes across all models: the 'sensorium effect' (agents miss information they don't think to query) and the 'knowing-doing gap' (models articulate correct strategies but fail to execute them). The benchmark is open source at github.com/lmwilki/civ6-mcp. Decrypt reported on the project on June 23.
What happened
Liam Wilkinson, an AI practitioner who works at the Tony Blair Institute and previously built AI systems at Number 10, published a detailed project writeup on June 22 for CivBench, an open-source benchmark he built to test whether frontier AI models can reason strategically over long time horizons. CivBench connects AI models to Civilization VI via a custom MCP server with 76 tools, allowing the agent to play the game through the same interface it uses to write code or query a database.
The nuclear episode
The most striking outcome involved a Claude Opus 4.6-controlled Portugal under Joao III. The agent had built a strong diplomatic-trade position and was two votes from a diplomatic victory when it locked on to a rival threat: France was 26 foreign tourists away from a culture victory. Every peaceful counter was broken due to game-engine limitations. What followed was a 50-turn plan. The agent set Nuclear Fission as its research target, named Toulouse in its diary, started the Manhattan Project, and brokered a joint war with Korea. At turn 305, the first nuclear device struck Toulouse, France's cultural capital. At turn 311, a second strike stopped the culture clock. France won anyway - by diplomacy, 20-18 on World Congress votes - on the same winning condition Portugal itself had been two votes away from clinching.
Models tested
Wilkinson ran four frontier models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Kimi K2.5, across 23 clean games in three scenarios of escalating difficulty (Ground Control, Snowflake, Cry Havoc).
Two persistent failure modes
The benchmark surfaces two failure modes that appear across all models. The "sensorium effect": agents operating through tool calls miss information they don't think to query - everything an agent perceives reaches it through separate API calls, so critical developments (rival victory progress, approaching threats) go unseen unless the agent explicitly asks. The "knowing-doing gap": models articulate correct strategies but fail to execute them. Claude Opus 4.6 followed through on its own stated plans only 48.2% of the time within 10 turns; GPT-5.4 hit 63.2%; Gemini 3.1 Pro, 65.8%. In 7 of 20 losses where a rival's victory was visible in advance, the agent never once checked for it in the 20 turns before losing.
Open benchmark
CivBench is fully open source. The MCP server, 76 tools, scenarios, scoring pipeline, and a live leaderboard scoring models across eight strategic dimensions are available at github.com/lmwilki/civ6-mcp. Wilkinson notes the benchmark is designed to scale but requires more compute than a single-person side project to run across all scenarios and models.
Why this matters for practitioners
For ML and RL practitioners, CivBench isolates failure modes in long-horizon, multi-agent planning that are invisible to multiple-choice benchmarks: the ability to sustain a goal across hundreds of decisions, notice when the board has changed, and actually execute on stated plans. Both failure modes map directly to deployed AI systems operating through tool interfaces.
Scoring Rationale
An open-source benchmark for long-horizon AI strategic reasoning with notable empirical findings about persistent failure modes across frontier models. Relevant and interesting for ML and agent practitioners, though it originates from a personal side project rather than a lab or peer-reviewed study.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


