Skip to content

OpenAI Built the First Real Test of AI Scientists. Its Best Model Passed 36%.

DS
LDS Team
Let's Data Science
9 min
LifeSciBench asked frontier models to do 750 real drug-discovery tasks written by 173 PhD scientists. OpenAI's own GPT-Rosalind led the field at a 36.1% pass rate, and every model got dramatically worse the moment it had to read an actual data file.

Here is one of the 750 questions OpenAI used to test the smartest AI models on Earth.

A biotech team is preparing for a Type B meeting with the FDA about an experimental gene therapy for Duchenne muscular dystrophy. They have Western blot data, immunofluorescence results, a 48-week functional readout, safety signals from twelve boys, and a single question: does this package actually support accelerated approval? The AI is handed the raw numbers and asked to pressure-test the case the way a skeptical regulator would, item by item, and say where the evidence falls apart.

This is not a multiple-choice question. There is no answer key to match against. It is the kind of judgment call a working drug-development scientist makes under real stakes, and on June 17, OpenAI published a benchmark built entirely out of problems like it.

The benchmark is called LifeSciBench, and the headline result is sobering. The best-performing model OpenAI tested, its own life-sciences system GPT-Rosalind, passed just 36.1% of the tasks. Every other frontier model did worse. Nearly two out of three research-level problems still defeat the most capable AI available.

A Benchmark Designed to Break the Usual Grading Trick

Most AI biology benchmarks look like a standardized exam: multiple-choice questions, one correct option, a clean string to match against. That format is easy to grade and almost nothing like what a scientist does all day. Researchers reconcile conflicting results, design experiments under uncertainty, troubleshoot failed assays, and decide what to do next when the evidence is incomplete.

LifeSciBench was built to close that gap. OpenAI assembled 750 expert-authored tasks spanning seven research workflows and seven biological domains, from genomics and medicinal chemistry through clinical and translational science. The tasks were written by 173 scientists, each with PhD-level training and direct experience advancing drug programs in biotech or pharma, in collaboration with Tacit Labs, a startup focused on feedback loops for drug development.

The grading is where the benchmark gets technically interesting. Rather than checking a final answer, each task carries a detailed rubric that breaks the expected response into specific claims, calculations, caveats, and decisions. Across the benchmark, those rubrics contain 19,020 individual criteria, an average of 25 per task. A model earns partial credit for each criterion it satisfies, and a task counts as "passed" only when the model clears 70% of the available rubric points.

That design captures something a single accuracy number cannot. A response can reach the right high-level conclusion and still be marked down for missing a key assay limitation. A partial answer can contain genuinely good reasoning without solving the task. LifeSciBench scores both: a normalized rubric reward, and a separate task pass rate. The methodology echoes the rubric-and-judge approach that has become standard in serious model evaluation, the same family of techniques covered in the LDS guide to LLM evaluation with RAGAS and LLM-as-judge.

The work was also unusually multimodal. More than half of the tasks (53%) require the model to interpret at least one attached artifact, and the full set includes 1,062 artifacts: figures, PDFs, tables, genomic sequence files, chemical structure files, and web references. Roughly 79% of tasks demand multiple reasoning steps, averaging four steps each.

The Leaderboard Tells a Modest Story

Five models were evaluated in a single-turn setting, each seeing the prompt and any attachments once, with internet browsing allowed. GPT-Rosalind led, but the absolute numbers are low across the board.

ModelNormalized ScoreTask Pass Rate
GPT-Rosalind0.57636.1%
GPT-5.50.51925.7%
Gemini 3.1 Pro0.51523.6%
GPT-5.40.47920.7%
Grok 4.30.39913.0%

GPT-Rosalind improved on OpenAI's general-purpose GPT-5.5 by more than ten percentage points, a real jump. But the ranking hides as much as it reveals. GPT-Rosalind led on 386 of 750 tasks, while Gemini 3.1 Pro uniquely topped the field on 214 of them, a reminder that an aggregate score can bury task-specific strengths. Anthropic's Claude models were not included in the evaluation.

The hardest part of the benchmark stayed out of reach for everyone. No model passed 171 tasks at all, and on 261 tasks the best model managed a pass rate below 20%. A third of LifeSciBench is currently a wall that no frontier system can reliably climb.

Where the models did show strength was telling. Frontier AI is best at structured judgment: scientific communication, translation, and evidence synthesis, the categories where the output format is stable and the task is to organize and explain. It is weakest exactly where research gets creative and constrained. Design, optimization, and prediction was one of the hardest workflows, with GPT-Rosalind clearing only 30.7%; analysis came in at 30.3%.

The Most Important Number Is the One About Data Files

For anyone deciding whether to put an AI model anywhere near a real research pipeline, the single most useful finding in LifeSciBench is not the 36% headline. It is what happens when you hand the model a file.

GPT-Rosalind's pass rate falls from 45.1% on text-only tasks to 28.1% on tasks that require reading at least one attached artifact, a drop of 17 percentage points. GPT-5.5 shows the identical pattern, sliding from 29.9% to 21.9%.

Task TypeGPT-Rosalind Pass RateGPT-5.5 Pass Rate
Text-only45.1%29.9%
With attached artifacts28.1%21.9%

This matters because of where scientific data actually lives. Almost nothing in drug discovery exists as plain prose. The signal sits in figures, assay output files, spreadsheets, spatial transcriptomics datasets, and chemical structure databases. A model that answers text questions competently but stumbles when handed a genomic sequence file has a gap precisely in the part of the workflow that carries the real information.

The picture gets worse for tasks that demand exact outputs. GPT-Rosalind reached only 14.8% on numeric tasks and 24.0% on tasks requiring exact sequence or structure outputs, the kind of precision needed for CRISPR donor design or siRNA design, where a small error makes the answer useless. Models often got partway there: in roughly 14% of tasks, a model earned at least half the rubric credit while still failing the pass threshold, identifying the right evidence but missing a constraint, using the wrong data, or never connecting reasoning to a usable decision.

The Conflict of Interest Nobody Should Ignore

There is a structural problem at the center of LifeSciBench, and OpenAI does not hide it: the company that designed and administered the benchmark is the same company whose model tops the leaderboard. A 36.1% score from OpenAI about an OpenAI model is not the same thing as independent validation.

The concern is well documented. A peer-reviewed analysis published in Nature Medicine on June 12, examining OpenAI's earlier HealthBench evaluation, concluded that industry-built benchmarks can systematically favor their creators' systems and called for independently constructed tests. Early reactions to LifeSciBench raised similar objections about opaque expert selection and rival-focused framing.

OpenAI built in guardrails that are worth taking seriously. The tasks were written by scientists external to the company. A separate cohort of 453 reviewers, 97% of them holding doctorates with an average of twelve years of experience, validated the task set and reached more than 96% agreement on whether tasks were realistic, well-grounded, and useful. Each task averaged six automated review cycles and at least two rounds of expert review, with 90% reviewer agreement required per domain before acceptance.

Whether those safeguards fully offset a self-administered benchmark is a question the evaluation community will have to answer as outside researchers examine the preprint and the task set. The honest position is that LifeSciBench is a strong piece of measurement work with a built-in incentive that readers should keep in view. This is the same skepticism that served the field well when OpenAI made earlier scientific claims, including the contested math result the company later had to defend.

What a 36% Score Actually Means for Practitioners

The temptation is to read 36% as either a failure or a triumph. It is neither. It is a calibration.

A PitchBook analysis from January found that more than $17 billion has gone into AI drug discovery since 2019, and not one AI-developed drug has yet entered large-scale clinical trials. The distance between a benchmark score and a drug approval is long and poorly understood. LifeSciBench does not shrink that distance. What it does is mark, with unusual precision, the specific workflow categories, artifact types, and reasoning steps where frontier AI fails often enough that unsupervised use would be reckless.

That is genuinely useful to a research director or an enterprise buyer. The benchmark says, in effect: lean on these systems for literature synthesis, protocol drafting, and structured scientific communication, where they score highest, and keep a human firmly in the loop for experimental design, quantitative analysis, and anything that depends on reading a data file correctly. It is the practical inverse of the hype, and it lands close to the lesson of Humanity's Last Exam, the other benchmark built to find the ceiling rather than celebrate the floor.

GPT-Rosalind itself, the model at the top of this leaderboard, was built specifically for drug discovery and is already in use at Amgen, Moderna, and Novo Nordisk. Its lead lab does not oversell it. Joy Jiao, OpenAI's life sciences research lead, said at the model's April launch that the company does not believe AI can yet create new disease treatments on its own. The benchmark released this week is the receipt for that caution.

The Bottom Line

OpenAI spent enormous effort building a test its own flagship model would visibly fail most of, and that is the most credible thing about it. A benchmark where the best score is 36% cannot be accused of being designed to flatter. The artifact gap is the finding that should travel: the moment these models stop reading text and start reading the figures, files, and structures where science actually happens, their reliability falls off a cliff.

The number that matters next is not a benchmark score at all. It is whether any of this translates into a real research program moving faster, a question OpenAI itself says can only be answered by studying the models in live labs over months, not in a single-turn eval. Until then, the most honest reading of LifeSciBench is the one a working scientist would give: useful collaborator, unreliable authority, and absolutely not ready to be left alone with the data.

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths