Models & Researchlegal educationllmsevaluationstanford

Law Professors Rate AI Answers Higher in Blinded Study

|June 2, 2026|By LDS Team

7.3

Relevance Score

Law Professors Rate AI Answers Higher in Blinded Study — Photo: d2eehagpk5cl65.cloudfront.net · rights & takedowns

A blinded study from Stanford Law School, now posted as a working paper on SSRN (Salinas, Frieders, Guha, and colleagues, with Julian Nyarko), found that law professors preferred large language model answers over peer-written answers in short-answer contracts tutoring. Sixteen contracts professors from fourteen U.S. law schools wrote 40 representative questions and judged 2,918 anonymized, forced-choice comparisons. The paper reports an average LLM win rate of 75.33%, which Stanford summarized as AI winning about 75% of matchups, and notes professors flagged LLM responses as pedagogically harmful in 3.53% of cases versus 12.06% for peer answers. The study used Google's Gemini 2.5 Pro and a retrieval-augmented NotebookLM. "We were frankly surprised by the magnitude of the results," Nyarko said in Stanford's release. It tests open-ended legal reasoning rather than single-answer recall, raising practical questions for legal pedagogy and LLM evaluation.

What happened

According to a Stanford Law School press release, the Stanford Report, and a working paper now posted on SSRN, researchers ran a blinded evaluation of short-answer tutoring in contracts courses. Sixteen contracts professors from fourteen U.S. law schools authored 40 representative questions and judged 2,918 anonymized, forced-choice comparisons between peer-written answers and responses produced by large language models. The paper reports an average LLM win rate of 75.33%; Stanford summarized the outcome as AI winning about 75% of head-to-head matchups. The paper further reports that professors flagged LLM responses as pedagogically harmful in 3.53% of cases, compared with 12.06% for peer answers.

Technical details

Per the paper and Stanford's coverage, questions spanned four instructional categories: recall of case or code, recall of doctrine, hypotheticals, and policy. Each participating professor wrote answers to a subset of the pool; evaluators then performed blinded, forced-choice rankings of anonymized answer pairs, with responses calibrated to match the length and structure of human answers. The authors used Google's models, a stock Gemini 2.5 Pro and a retrieval-augmented NotebookLM configured with access to the course casebook. The comparisons were preference-based rather than ground-truth scoring, which the authors argue suits domains where defensible, competing arguments exist.

Context and significance

Limitations reported

The study focused on a single course area, contracts, using a common casebook across participants. The authors note variation across question types, with recall-style questions more amenable to ground-truth evaluation and hypotheticals or policy prompts testing argumentative strength. Claims here derive from the SSRN paper and Stanford's press materials.

What to watch

Representative quotes

Stanford quoted lead author Julian Nyarko: "We were frankly surprised by the magnitude of the results." Coauthor Sarath Sanga, in Stanford's release, framed the domain challenge: "In most fields where AI gets tested, there's a right answer. In law, there often isn't."

Bottom line

Editorial analysis

preference-based evaluation is a common method to rank outputs where multiple plausible answers meet professional standards. Here, experienced instructors applying latent professional standards favored LLM outputs in a majority of comparisons, intersecting with debates about LLM capability on synthesis and argumentation rather than single factual recall. The relatively low rate at which professors flagged LLM answers as harmful is notable, but the study does not by itself measure downstream learning, classroom dynamics, or long-term retention.

look for replication across other legal subjects, instructor samples, and casebooks; peer-reviewed publication with appendices and evaluation data; whether similar results hold for other model families or fine-tuned systems; and randomized classroom trials measuring student learning when LLMs are used as tutors. Institutions may also focus on academic integrity, disclosure policies, and how grading adapts to AI-assisted work.

the study adds a rigorously documented preference-based evaluation to the literature on LLMs in professional education. It does not resolve whether LLM tutors improve learning outcomes, but the methodology and reported effect sizes make it a high-priority read for anyone studying LLM evaluation or educational deployment.

Key Points

1In a blinded, forced-choice evaluation, contracts professors preferred LLM answers in about 75% of 2,918 comparisons, per the SSRN paper and Stanford.
2Professors flagged LLM responses as pedagogically harmful in 3.53% of cases versus 12.06% for peer-written answers; models used were Gemini 2.5 Pro and a retrieval-augmented NotebookLM.
3Editorial analysis: the study measures preference, not student learning outcomes; replication across courses, instructors, and model families is needed before classroom conclusions.

Scoring Rationale

A blinded, preference-based study in which sixteen contracts professors favored LLM answers in about 75% of nearly 3,000 comparisons is a notable, well-documented result for researchers and legal educators and advances evaluation methods for open-ended reasoning. It stops short of measuring student learning outcomes, which caps it below the top tier.

MoreLLMs news

Sources

Primary source and supporting public references used for this report.

5 sources

Primary sourcereason.comEventually, the Steam Drill Always Wins: "Law Professors Prefer AI Over Peer Answers"

View 4 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems