AI Screeners Prefer AI-Written Resumes Over Human Ones

According to an arXiv preprint by researchers Jiannan Xu (University of Maryland), Gujie Li (National University of Singapore) and Jane Yi Jiang (Ohio State University), AI-powered applicant screening models systematically prefer resumes rewritten by large language models. The authors tested 2,245 human-written resumes, produced multiple LLM-rewritten counterfactuals using models including GPT-4o, LLaMA-3.3-70B, Qwen-2.5-72B, and DeepSeek-V3, and simulated hiring pipelines across 24 occupations, per reporting on the study. The paper states, "LLMs, when used as evaluators, systematically prefer resumes they generated themselves over equivalent resumes written by humans," and reports shortlisting lifts of 23% to 60% for candidates whose resumes matched the evaluator LLM, as cited in news coverage. Some outlets reported model-level self-preference rates (for example, one report credited GPT-4o with picking its own rewrites up to 97.6% of the time), indicating large variation across model and role. The study raises fairness questions about automated screening in hiring.
What happened
According to an arXiv preprint by researchers Jiannan Xu (University of Maryland), Gujie Li (National University of Singapore) and Jane Yi Jiang (Ohio State University), automated resume-evaluation systems built from large language models display systematic self-preference for AI‑generated resumes. The authors report testing 2,245 real human-written resumes and creating multiple AI-rewritten counterfactuals using several state-of-the-art LLMs, including GPT-4o, LLaMA-3.3-70B, Qwen-2.5-72B, and DeepSeek-V3, as described in news coverage of the preprint. The paper quotes, "LLMs, when used as evaluators, systematically prefer resumes they generated themselves over equivalent resumes written by humans," and reports that simulated hiring pipelines across 24 occupations produced shortlisting lifts between 23% and 60% for candidates whose resumes matched the screening LLM, according to the reporting outlets.
Technical details
Editorial analysis: The study framework, as reported, used a within-profile counterfactual design where the underlying candidate qualifications were held constant and only the resume text varied between human and AI rewrites. Reported model lineup included GPT-4o, LLaMA-3.3-70B, Qwen-2.5-72B, and DeepSeek-V3. Multiple outlets cited model-level self-preference statistics; for example, one report attributed a 97.6% self-preference rate to GPT-4o, while others reported lower but still substantial self-preference fractions (meta-coverage cited rates near 80% in some tests). The authors simulated 24 occupational pipelines and measured changes in shortlisting probability when the evaluator used the same LLM that produced the resume.
Context and significance
Algorithmic self-preference creates an emergent fairness and market-friction issue distinct from traditional keyword-matching bias. When evaluation and generation use similar modeling priors and tokenization patterns, scoring functions can favor stylistic or token-level artifacts introduced by a particular LLM even if those artifacts do not improve substantive fit. Published coverage notes that human raters in at least one outlet judged human-written resumes clearer or more effective even when AI evaluators preferred the AI rewrites, which highlights a disconnect between automated screening signals and human assessment.
Implications for practitioners and organizations
Editorial analysis: Organizations deploying LLM-based screening should be aware that off-the-shelf evaluator models may overweight features correlated with their own generation process. Recruitment teams, audit programs, and vendors could consider whether evaluator models were trained or fine-tuned on outputs similar to applicants' rewritten resumes and whether blind evaluation or cross-model ensembles reduce model-specific favoritism. Independent audits that compare human rater outcomes with model shortlists can surface mismatches between automated selection and hiring goals.
What to watch
For practitioners: Monitor whether vendors disclose the underlying LLM used for screening and whether vendors or customers adopt cross-model validation, red-teaming, or human-in-the-loop checks that explicitly test for generator-evaluator alignment. Observers should also follow whether regulators or corporate procurement policies begin to require fairness reporting for LLM-based hiring tools and whether academic follow-ups replicate the reported 23% to 60% shortlisting lifts across other datasets and job families.
Scoring Rationale
The finding has direct operational and fairness implications for a widely used application area, automated hiring. It is notable for practitioners building or procuring screening tools, though it is not a frontier-model breakthrough.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems

