LLMs Exhibit Static Level-k Behavior in Games
According to an arXiv preprint by Po Han Teo (arXiv:2606.27845, "LLM Agents as Static Level-k Players in Behavioural Games"), LLM choices in two classic behavioral-economics games, a p-beauty contest and a public goods game, reproduce the spread of human choices but not the underlying strategic reasoning. Across a 360-cell factorial sweep varying temperature, model scale (0.5-32B), quantization, instruct-vs-base mode, and framing, the paper finds LLMs behave as static, category-retrieved level-k players whose apparent reasoning depth correlates with model scale, rather than agents that update beliefs or backward-induct across rounds. For practitioners using LLMs as stand-ins for human agents in simulations or behavioral experiments, the paper is a concrete caution: matching aggregate choice dispersion is not the same as reproducing iterative strategic reasoning.
This paper matters to ML practitioners who use LLMs as stand-ins for human agents in simulations, multi-agent training, or behavioral experiments, because it shows a specific limit of current models in reproducing iterative strategic reasoning.
What happened
According to the arXiv preprint by Po Han Teo (arXiv:2606.27845), the author tests LLM behavior in two classical behavioral games, a p-beauty contest and a public goods game. The study evaluates five deployment dimensions together in a 360-cell factorial that varies temperature, model scale (0.5-32B), quantization, instruct vs base mode, and framing. The paper compares each cell's distribution of choices to published human choice distributions and reports several regularities: deployment settings, except quantization, affect different aspects of fidelity, and while dispersion observed in human play can be partly recovered, the strategic process cannot. The preprint reports that LLMs act like static, category-retrieved level-k players, with the retrieved level k correlating with model scale, and that models do not perform within-game belief-updating or backward induction across multi-round horizons. In the public goods experiment, human contributions decayed across rounds but LLM contributions stayed flat or rose; under an indefinite horizon LLMs were more cooperative than under a finite horizon, and LLMs did not show last-round defection.
Technical context
Models returning behavior consistent with a retrieved level-k profile implies they reproduce a strategy category rather than an iterative reasoning process. For practitioners, that distinction affects how to interpret multi-round simulations: a model that selects a strategy class at call time will not necessarily update that strategy based on observed in-game history unless prompted to simulate belief-updating explicitly. The paper's use of a full-factorial sweep across temperature and scale (0.5-32B) is useful methodologically because it separates dispersion-tuning, for example via temperature, from qualitative strategy generation tied to scale.
What to watch
Industry observers and researchers reusing LLMs as agents should treat single-call outputs as reflecting retrieved strategy archetypes rather than dynamically updated beliefs. Follow-up experiments probing prompt engineering or fine-tuning interventions that explicitly encode belief-updating, backward induction, or memory across rounds would test whether the observed static level-k pattern is an artifact of prompting and deployment or a more fundamental modeling limitation.
Key Points
- 1LLMs can match aggregate choice dispersion but often lack iterative belief-updating, limiting fidelity for multi-round strategic simulations.
- 2Model scale from 0.5B to 32B aligns with retrieved level-k categories, suggesting scale shapes strategy archetype rather than dynamic reasoning.
- 3Deployment settings like temperature affect choice dispersion, but they do not substitute for procedural belief-updating or backward induction.
Scoring Rationale
A targeted, verified arXiv evaluation of LLMs as strategic-game stand-ins, relevant to practitioners running agent simulations or behavioral experiments. The 360-cell factorial design is methodologically sound and the source is confirmed. Impact is narrow, at the intersection of behavioral game theory and LLM evaluation, with practical takeaways mostly limited to simulation design.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


