A new reasoning model from OpenAI has achieved record-breaking scores on graduate-level STEM evaluations — outperforming the average human expert by 14 points across biology, chemistry, and physics. Researchers are now debating what this means for AI's role in accelerating scientific discovery.
OpenAI's latest reasoning model has achieved unprecedented performance on a battery of graduate-level STEM benchmarks, outscoring the average PhD-level human expert by a margin researchers describe as 'statistically significant and practically meaningful.'
The model, evaluated across biology, chemistry, physics, and mathematics domains, demonstrated the ability to construct multi-step logical chains, identify errors in peer-reviewed papers, and generate novel experimental hypotheses — capabilities that were considered the exclusive province of trained scientists as recently as 2023.
What the benchmarks actually measure
The evaluation suite used in this study is notably different from standard LLM benchmarks. Rather than testing recall or pattern matching, it presents unseen research problems that require genuine reasoning — problems sourced from recent PhD qualifying exams and unpublished research competition problems designed to prevent training data contamination.
Across 847 novel benchmark problems, the model achieved a mean score of 71.3% compared to a human expert average of 57.1%. The gap was largest in chemistry (18 points) and smallest in theoretical physics (6 points).
Critics of the evaluation methodology point out that benchmark performance does not directly translate to real-world scientific productivity. Dr. Sarah Chen, a computational biologist at MIT, notes that 'solving a structured exam question is meaningfully different from the messy, open-ended nature of live research.' That said, she acknowledges the results are 'harder to dismiss than previous AI science claims.'
Implications for research pipelines
Several biotech and pharmaceutical companies have already deployed early versions of reasoning-capable models as part of their drug discovery pipelines. The practical question is no longer whether AI can reach expert-level performance on isolated tasks — it's whether it can operate reliably in noisy, real-world research environments where ground truth is often unknown.
If the trajectory holds, the implications for scientific labor markets, publication norms, and peer review processes are substantial. Researchers who specialize in literature synthesis and hypothesis generation may find their roles shifting sooner than anticipated.
OpenAI has not yet published the full model card or system prompt used during evaluation. Independent replication attempts are already underway at Stanford and DeepMind.