Law Professors Rate AI Answers Higher in Blinded Study

According to a Stanford Law School press release and a working-paper draft by Julian Nyarko et al., a blinded study of short-answer tutoring in contracts courses found that large language models outperformed human instructors in peer comparisons. Sixteen contracts professors from fourteen U.S. law schools authored 40 representative questions and judged 2,918 anonymized pairwise comparisons; the paper draft quoted in Reason reports an average LLM win rate of 75.33%, while Stanford summarized the result as AI winning 75% of matchups. The draft also reports that professors flagged LLM responses as pedagogically harmful in 3.53% of cases versus 12.06% for peer-written answers. "We were frankly surprised by the magnitude of the results," Julian Nyarko said in Stanford's press release. Editorial analysis: This experiment tests LLMs on open-ended legal reasoning rather than single-answer tasks, raising practical questions for legal pedagogy and evaluation methods.
What happened
According to a Stanford Law School press release and a working-paper draft by Julian Nyarko and coauthors, researchers conducted a blinded evaluation of short-answer tutoring in contracts courses. Sixteen contracts professors from fourteen U.S. law schools authored 40 representative questions and judged 2,918 anonymized, forced-choice comparisons between peer-written answers and responses produced by large language models. The paper draft quoted in Reason reports an average LLM win rate of 75.33%; Stanford's press release summarized the outcome as AI winning 75% of head-to-head matchups. The draft further reports that professors flagged LLM responses as pedagogically harmful in 3.53% of cases compared with 12.06% for peer answers.
Technical details
Per the draft excerpts published by Reason and the Stanford press release, the study used questions spanning four instructional categories: recall of case or code, recall of doctrine, hypotheticals, and policy. Each participating professor wrote answers to a subset of the question pool; evaluators then performed blinded, forced-choice rankings of anonymized answer pairs. The authors selected Google's models for the experiment, including a stock Gemini 2.5 Pro and a retrieval-augmented NotebookLM configured with access to the course casebook, according to the draft. The comparisons were preference-based rather than ground-truth scoring, which the authors argue is appropriate for domains where defensible, competing arguments exist.
Context and significance
Editorial analysis: Preference-based evaluations have become a common method to rank outputs where multiple plausible answers can meet professional standards. For pedagogical evaluation in law, the study shows that experienced instructors applying latent professional standards favored LLM outputs in a majority of head-to-head comparisons. This outcome intersects with ongoing debates about LLM capabilities on tasks that require synthesis, argumentation, and contextualization rather than single factual recall. The relatively low rate at which professors flagged LLM answers as harmful, compared with peer answers, is notable; however, the study does not by itself measure downstream learning outcomes, classroom dynamics, or long-term retention.
Limitations reported and methodological notes
The claims above are drawn from a press release and a working-paper draft quoted by Reason; the draft documents experimental design choices, model selection, and the preference-based evaluation method. The study focused on a single course area, contracts, using a common casebook across participants. The authors note variation across question types, with recall-style questions being more amenable to ground-truth evaluation and hypotheticals or policy prompts testing argumentative strength. The draft also reports that the LLMs performed comparably to the best instructor in the sample on aggregated preference metrics.
What to watch
Editorial analysis: Observers should look for replication across other legal subjects and across different instructor samples and casebooks. Practitioners and researchers will likely track:
- •peer-reviewed publication of the full paper with appendices and code or evaluation data
- •whether similar preference results hold using different model families or fine-tuned systems
- •randomized classroom trials measuring student learning and retention when LLMs are used as tutors
- •qualitative studies of how faculty and students interpret and use LLM feedback. Regulators and institutions may also focus on academic integrity, disclosure policies, and how grading practices adapt to AI-produced assistance
Representative quotes from the reporting
Stanford quoted lead author Julian Nyarko: "We were frankly surprised by the magnitude of the results." Coauthor Sarath Sanga, quoted in Stanford's release, framed the domain challenge: "In most fields where AI gets tested, there's a right answer. In law, there often isn't."
Bottom line
Editorial analysis: The study adds a rigorously documented preference-based evaluation to the literature on LLMs in professional education. It does not resolve whether LLM tutors improve student learning outcomes, nor does it address institutional policy choices. Nonetheless, the methodological framing and reported effect sizes make this a high-priority paper for anyone studying LLM evaluation, educational deployment, or assessments of reasoning in domains with open standards.
Scoring Rationale
A well-documented, blinded study showing large preference margins for LLM outputs in legal tutoring is notable for researchers and educators. It advances evaluation methods for open-ended reasoning tasks but stops short of measuring student learning outcomes.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


