The generative AI pioneer OpenAI released an open-source benchmark for researchers to evaluate the factuality of AI models.
OpenAI posted SimpleQA to GitHub to help developers measure the ability of large language models (LLMs) to answer short and "fact-seeking questions," according to the company this week.
OpenAI is sharing the benchmark code to be transparent about the "accuracy" numbers it publishes alongside its latest models.
The SimpleQA benchmark is designed to be a "challenge" to foundation models and features 4,326 questions on a wide range of topics, from science and technology to TV shows and video games.
Researchers can use the benchmark to measure the calibration of LLMs by directly asking the model to state its confidence in its answer. A "perfectly" calibrated model would have the same rate of accuracy as its stated confidence.
Researchers using the benchmark can also ask the model the same question 100 times. A "well-calibrated" model would have the same rate of accuracy as its frequency of the same response.
OpenAI hopes that open sourcing SimpleQA drives research on "more trustworthy and reliable AI."
A research team at OpenAI also published a 14-page paper on the benchmark: "Measuring Short-Form Factuality in Large Language Models."