“Andrew White, a chemist at FutureHouse, a non-profit organization in San Francisco that focuses on how AI can be applied to molecular biology, says that observers have been surprised and disappointed by a general lack of improvement in chatbots’ ability to support scientific tasks over the past year and a half, since the public release of GPT-4. The o1 series, he says, has changed that.
Strikingly, o1 has become the first large language model to beat PhD-level scholars on the hardest series of questions — the ‘diamond’ set — in a test called the Graduate-Level Google-Proof Q&A Benchmark (GPQA). OpenAI says that its scholars scored just under 70% on GPQA Diamond, and o1 scored 78% overall, with a particularly high score of 93% in physics…
OpenAI also tested o1 on a qualifying exam for the International Mathematics Olympiad. Its previous best model, GPT-4o, correctly solved only 13% of the problems, whereas o1 scored 83%.”
From Nature.