“In one literature-review benchmark known as ScholarQA-CS, GPT-5 ‘performs well’ when it is allowed to access the web, says Akari Asai, an AI researcher at the Allen Institute for Artificial Intelligence, based in Seattle, Washington, who ran the tests for Nature. In producing answers to open-ended computer-science questions, for example, the model performed marginally better than human experts did, with a correctness score of 55% (based on measures such as how well its statements are supported by citations) compared with 54% for scientists, but just behind a version of institute’s own LLM-based system for literature review, OpenScholar, which achieved 57%.
However, GPT-5 suffered when the model was unable to get online, says Asai. The ability to cross-check with academic databases is a key feature of most AI-powered systems designed to help with literature reviews. Without Internet access, GPT-5 fabricated or muddled half the number of citations that one of its predecessors, GPT-4o, did. But it still got them wrong 39% of the time, she says.
On the LongFact benchmark, which tests accuracy in long-form responses to prompts, OpenAI reported that GPT-5 hallucinated 0.8% of claims in responses about people or places when it was allowed to browse the web, compared with 5.1% for OpenAI’s reasoning model o3. Performance dropped when browsing was not permitted, with GPT-5’s error rate climbing to 1.4% compared with 7.9% for o3. Both models showed worse performance than did the non-reasoning model GPT-4o, which had an error rate of 1.1% when offline.”
From Nature.