Models & Researchopenaievaluationbenchmarksai research

OpenAI Introduces GeneBench-Pro for Computational Biology Reasoning

||By LDS Team
7.4
Relevance Score
OpenAI Introduces GeneBench-Pro for Computational Biology Reasoning

For teams building AI-for-science systems, the bottleneck is no longer recalling facts or running a fixed pipeline; it is the higher-order judgment of deciding which analysis a messy dataset can actually support. GeneBench-Pro, released by OpenAI on June 30, 2026, is built to measure exactly that. The benchmark presents an agent with 129 synthetic problems across genomics, quantitative biology, and translational medicine, each pairing a realistic and deliberately noisy dataset with a target estimand tied to a downstream decision. Because every problem is generated from a known causal structure, correctness is graded deterministically, sidestepping the rubric variability that weakens many long-horizon science benchmarks. OpenAI reports its strongest model, GPT-5.6 Sol, solves 28.7 percent of problems at the highest reasoning level and 31.5 percent with Pro mode, up sharply from below 5 percent for GPT-5 when the original GeneBench was built. OpenAI frames the gap to open-weight models such as GLM 5.2 as evidence that open systems are tuned more for coding than for broad scientific reasoning. Reviewers estimated each problem would take a human expert 20 to 40 hours.

Why it matters

Most frontier benchmarks reward fact recall or execution of a known workflow, both of which top models have largely saturated. GeneBench-Pro instead probes what OpenAI calls research taste: the chain of judgment calls about which questions a dataset can support, when early diagnostics should change the model, and when a result is decision-ready. For practitioners deploying agents against real lab data, this is the capability that determines whether an agent assists a scientist or quietly produces a confidently wrong answer.

What OpenAI released

The benchmark contains 129 problems spanning 10 domains and 21 sub-domains, from statistical and population genetics to clinical pharmacogenomics and cancer genomics. Each task hands the agent an isolated workspace with data files, a short experimental context, and a standard bioinformatics stack including Python and PLINK 2.0. Critically, every problem is built synthetically from a known data-generating process, so OpenAI can grade against ground truth, tune difficulty, and verify through ablations that plausible-but-wrong analyses fail. OpenAI says it audited drafts for information leakage and shortcut solutions, and sent 82 problems to external domain experts for review. Ten representative questions and a 50-question subset are being opened for third-party benchmarking.

The results

GPT-5.6 Sol attains a 28.7 percent pass rate at the highest reasoning level and 31.5 percent with Pro mode, versus below 5 percent for GPT-5 at the original GeneBench. The data also illustrates test-time compute scaling: at the lowest reasoning level Sol scores in the single digits, and at the highest it solves roughly six times as many questions as GPT-5.2 while using about two-thirds the tokens. OpenAI states the performance gap to open-weight models such as GLM 5.2 is larger than coding benchmarks would predict.

Practitioner takeaways

Three patterns are worth internalizing. First, deterministic, simulation-backed grading is a credible answer to benchmark gaming and rubric noise, and is a design other evaluation teams can copy. Second, the failure mode reviewers flagged, that agents are not cautious enough about data irregularities like ancestry swaps or ancient-DNA bias, mirrors what breaks production data pipelines, not exotic science. Third, sub-third pass rates at thousands of dollars of human-expert labor per problem and only a few dollars of inference suggest partial automation already carries real economic value, even before reliability improves.

Watch next

OpenAI says it will provide a 50-question subset to Artificial Analysis for independent verification and expects the benchmark could approach saturation by year end, a pace that would compress the window in which GeneBench-Pro remains a useful discriminator.

Key Points

  • 1OpenAI released GeneBench-Pro, a 129-problem benchmark testing AI judgment over messy, real-world computational biology datasets across ten scientific domains.
  • 2Synthetic, simulation-backed construction lets OpenAI grade deterministically against ground truth, resisting the benchmark gaming and rubric noise that weaken many science evals.
  • 3GPT-5.6 Sol solves 28.7 percent at high reasoning, signaling fast progress yet ample headroom for AI-for-science automation.

Scoring Rationale

An important, well-constructed benchmark from a frontier lab targeting scientific judgment in computational biology, a capability current evals largely miss. The simulation-backed construction, domain-expert validation of 82 problems, and concrete frontier-model scores (GPT-5.6 Sol at 28.7%) make it immediately relevant to practitioners building AI-for-science agents and to evaluation teams seeking gaming-resistant designs. Domain scope keeps it below industry-shaking tier.

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems