Why it matters
Most frontier benchmarks reward fact recall or execution of a known workflow, both of which top models have largely saturated. GeneBench-Pro instead probes what OpenAI calls research taste: the chain of judgment calls about which questions a dataset can support, when early diagnostics should change the model, and when a result is decision-ready. For practitioners deploying agents against real lab data, this is the capability that determines whether an agent assists a scientist or quietly produces a confidently wrong answer.
What OpenAI released
The benchmark contains 129 problems spanning 10 domains and 21 sub-domains, from statistical and population genetics to clinical pharmacogenomics and cancer genomics. Each task hands the agent an isolated workspace with data files, a short experimental context, and a standard bioinformatics stack including Python and PLINK 2.0. Critically, every problem is built synthetically from a known data-generating process, so OpenAI can grade against ground truth, tune difficulty, and verify through ablations that plausible-but-wrong analyses fail. OpenAI says it audited drafts for information leakage and shortcut solutions, and sent 82 problems to external domain experts for review. Ten representative questions and a 50-question subset are being opened for third-party benchmarking.
The results
GPT-5.6 Sol attains a 28.7 percent pass rate at the highest reasoning level and 31.5 percent with Pro mode, versus below 5 percent for GPT-5 at the original GeneBench. The data also illustrates test-time compute scaling: at the lowest reasoning level Sol scores in the single digits, and at the highest it solves roughly six times as many questions as GPT-5.2 while using about two-thirds the tokens. OpenAI states the performance gap to open-weight models such as GLM 5.2 is larger than coding benchmarks would predict.
Practitioner takeaways
Three patterns are worth internalizing. First, deterministic, simulation-backed grading is a credible answer to benchmark gaming and rubric noise, and is a design other evaluation teams can copy. Second, the failure mode reviewers flagged, that agents are not cautious enough about data irregularities like ancestry swaps or ancient-DNA bias, mirrors what breaks production data pipelines, not exotic science. Third, sub-third pass rates at thousands of dollars of human-expert labor per problem and only a few dollars of inference suggest partial automation already carries real economic value, even before reliability improves.
Watch next
OpenAI says it will provide a 50-question subset to Artificial Analysis for independent verification and expects the benchmark could approach saturation by year end, a pace that would compress the window in which GeneBench-Pro remains a useful discriminator.
Key Points
- 1OpenAI released GeneBench-Pro, a 129-problem benchmark testing AI judgment over messy, real-world computational biology datasets across ten scientific domains.
- 2Synthetic, simulation-backed construction lets OpenAI grade deterministically against ground truth, resisting the benchmark gaming and rubric noise that weaken many science evals.
- 3GPT-5.6 Sol solves 28.7 percent at high reasoning, signaling fast progress yet ample headroom for AI-for-science automation.
Scoring Rationale
An important, well-constructed benchmark from a frontier lab targeting scientific judgment in computational biology, a capability current evals largely miss. The simulation-backed construction, domain-expert validation of 82 problems, and concrete frontier-model scores (GPT-5.6 Sol at 28.7%) make it immediately relevant to practitioners building AI-for-science agents and to evaluation teams seeking gaming-resistant designs. Domain scope keeps it below industry-shaking tier.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

