OpenAI has unveiled GeneBench‑Pro, a benchmark that pushes artificial‑intelligence agents to tackle the judgment‑heavy analyses that real‑world computational biologists routinely perform. The new suite builds on the original GeneBench by adding 129 synthetic problems that span genomics, quantitative biology and translational medicine.

GeneBench‑Pro asks a model to decide whether a dataset can answer a question, to adjust its analysis plan when the data reveal unexpected patterns, and to determine when a result is ready for decision‑making. Each problem delivers a realistic, messy data file, a brief experimental context and a target estimand that ties the analysis to a downstream decision. The model must explore the data, pick a suitable statistical approach, iterate on the analysis and produce a final answer.

Unlike a single, deterministic pipeline, real scientific work requires interpreting noisy data, revising assumptions and choosing among multiple viable paths. To capture this complexity, the benchmark’s developers generate data from known causal structures and simulate the data‑generating process. The synthetic construction lets evaluators verify that only correct analytical pathways lead to the expected answer, while plausible but incorrect analyses fail.

To ensure realism, 82 of the 129 problems were sent to domain experts—including graduate students, postdoctoral researchers, industry scientists and professors—for review. One reviewer noted that the problems would challenge a graduate student without iterative feedback from a supervisor, highlighting the presence of technical and quality‑control issues that require careful analysis.

OpenAI tested GeneBench‑Pro with its frontier GPT models. The top performer, GPT‑5.6 Sol, achieved a pass rate of 28.7 % at the highest reasoning level and 31.5 % when run in Pro mode—an impressive jump from the original GeneBench, where GPT‑5 scored below 5 %. Results also show that scaling test‑time compute improves performance: at the highest reasoning level, GPT‑5.6 Sol solves nearly six times as many questions as GPT‑5.2 while using about two‑thirds as many tokens.

Comparisons across model families confirm that GPT models remain the strongest performers on high‑level scientific reasoning under quantitative uncertainty. Open‑source models such as GLM‑5.2 lag behind, a gap larger than expected when extrapolated from coding benchmarks. The developers acknowledged that frontier GPT models were used to harden the problems, but competitor models at release matched the performance of the corresponding GPT model and generally fell short.

GeneBench‑Pro is publicly available. OpenAI has released 10 representative questions on Hugging Face, complete with an interactive web interface, and will provide a 50‑question subset to Artificial Analysis for independent third‑party benchmarking.

The benchmark underscores the cost gap between human and AI analysis. Reviewers estimate that a typical GeneBench‑Pro problem would take a human expert 20–40 hours to complete, costing several thousand dollars at a conservative $200 per hour. In contrast, inference costs for an AI agent are only a few dollars per problem. Even partial automation at current capabilities could therefore deliver significant economic and scientific value.

OpenAI’s statement frames the benchmark as an initial effort to evaluate the abstract skills required for good scientific judgment. If AI agents can reliably automate this class of analysis, they could accelerate hypothesis triage, target follow‑up and the iterative cycle between data generation and decision‑making in biomedical research.

The release comes at a time when sequencing costs have fallen sharply and biobank‑scale datasets link molecular, phenotypic and health‑record information. The bottleneck is increasingly the downstream computation and analysis, and GeneBench‑Pro provides a tool for measuring progress toward AI systems that can perform the complex, judgment‑heavy analyses that current teams of human experts carry out.

In summary, GeneBench‑Pro offers a rigorous, synthetic benchmark that tests AI agents on the high‑level reasoning required in computational biology. Current frontier models solve fewer than a third of the problems, indicating substantial room for improvement. The benchmark will help identify where models fail and guide future research aimed at closing the inferential loop.