Agentic AI Systems Match Human Economists in Causal Tasks

A new comparison study finds agentic AI systems produce median causal effect estimates similar to human economists, while humans show wider dispersion in estimates. The paper runs replicated causal-inference tasks and an AI review tournament where reviewer models Gemini 3.1 Pro Preview, Opus 4.6, and GPT-5.4 rank submissions. Rankings place Codex GPT-5.4 first, Codex GPT-5.3-Codex second, Claude Code Opus 4.6 third, and human researchers last. Results imply agentic AI can perform and reliably evaluate empirical economics workflows, suggesting a path to scale reproducible research and automated peer review, though variability across model instances and evaluation design remain important caveats for practitioners.
What happened
The paper compares agentic AI systems and human economists on identical causal-inference tasks and finds similar median causal effect estimates, with humans exhibiting wider-tailed estimate distributions. The study also runs an AI review tournament: reviewer models Gemini 3.1 Pro Preview, Opus 4.6, and GPT-5.4 each evaluate the same 300 comparison groups of submissions. Average rankings are consistent across reviewer models, ordering Codex GPT-5.4 first, Codex GPT-5.3-Codex second, Claude Code Opus 4.6 third, and human researchers fourth.
Technical details
The experiment focuses on replicated causal-inference tasks and two evaluation phases: estimation and reviewer-based ranking. Key experimental elements reported include:
- •multiple independent replications per AI system across tasks, establishing within-model dispersion and replicability
- •cross-model reviewer evaluations, where each reviewer produces comparison reports for the same 300 submission groups
- •ranking aggregation that produces a stable ordering of model-generated vs human-generated research artifacts
Why it matters Practitioners building empirical pipelines should note two operational implications. First, agentic AI systems can produce point estimates and analysis code that, on median, match human performance, reducing a core bottleneck in empirical work. Second, AI reviewers achieve consistent cross-model adjudication, which enables automated triage, ranking, and scaling of reproducibility checks. This points to practical automation for literature screening, pre-analysis code review, and synthetic replication.
Limitations and caveats
Dispersion across model instances is substantial, so per-instance variance matters for deployment. Human distributions had wider tails, which signals occasional extreme judgments rather than systematic superiority. Evaluation depended on chosen tasks, prompt protocols, and reviewer models, so external validity to other econometric problems or domains is not guaranteed.
What to watch
Reproducibility of results across additional tasks, larger model pools, and open benchmarks. Pay attention to how prompt engineering, chain-of-thought control, and calibration methods affect per-instance variance and hallucination rates.
Scoring Rationale
The paper demonstrates meaningful, reproducible parity between agentic AI and human economists on causal tasks and shows AI-based review can reliably rank submissions. This has practical implications for scaling empirical research, but it is not a paradigm-shifting model release, so it rates as a notable contribution to methods and evaluation.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


