News

OpenAI Shares Factuality Benchmark SimpleQA

By Chris Ehrlich

November 1, 2024

SAVED

How can the benchmark help reduce hallucinations?

The generative AI pioneer OpenAI released an open-source benchmark for researchers to evaluate the factuality of AI models.

OpenAI posted SimpleQA to GitHub to help developers measure the ability of large language models (LLMs) to answer short and "fact-seeking questions," according to the company this week.

OpenAI is sharing the benchmark code to be transparent about the "accuracy" numbers it publishes alongside its latest models.

The SimpleQA benchmark is designed to be a "challenge" to foundation models and features 4,326 questions on a wide range of topics, from science and technology to TV shows and video games.

Researchers can use the benchmark to measure the calibration of LLMs by directly asking the model to state its confidence in its answer. A "perfectly" calibrated model would have the same rate of accuracy as its stated confidence.

Researchers using the benchmark can also ask the model the same question 100 times. A "well-calibrated" model would have the same rate of accuracy as its frequency of the same response.

OpenAI hopes that open sourcing SimpleQA drives research on "more trustworthy and reliable AI."

A research team at OpenAI also published a 14-page paper on the benchmark: "Measuring Short-Form Factuality in Large Language Models."

See more: OpenAI Shares Benchmark MLE-bench for AI Agents

About the Author

Chris Ehrlich

Chris Ehrlich is the former editor in chief and a co-founder of VKTR. He's an award-winning journalist with over 20 years in content, covering AI, business and B2B technologies. His versatile reporting has appeared in over 20 media outlets. He's an author and holds a B.A. in English and political science from Denison University. Connect with Chris Ehrlich:

Main image: By Nubelson Fernandes.

Tags

news software engineering software development openai ai news artificial intelligence

Featured Research

Featured research

Designing an AI-Native Contact Center

Forget CCaaS bolt-ons. This blueprint shows how to scale with AI as your command center.

Featured research

The New Architecture of Customer Access

AI can cut handling time by 70% and boost CSAT by 32%. This white paper shows how to build around it.

Featured research

The Hidden Impact of AI in the Contact Center

AI’s biggest impact isn’t always flashy—it’s the smarter, quieter stuff that actually moves the needle.

Featured research

From Reactive to Ready: Scaling Contact Center Ops with WFM That Works

When customer demand shifts by the minute, your workforce strategy needs more than spreadsheets and guesswork it needs agile, AI-powered WFM built for digital service.

Featured research

Research Report

Get Better at ROI Projections for GenAI in Customer Service

Insights on how to turn GenAI pilots into enterprise-scale productivity systems—backed by architectural readiness and real outcomes

Featured research

Research Report

AI That Automates Outcomes - Not Just Interactions

Learn how to operationalize AI across the service stack with agent orchestration, copilots and unified data flows

Featured research

Research Report

Unlocking the Potential of Artificial Intelligence in Digital Customer Experience

Insights and Trends from Organizational Adoption

Featured research

How to Meet Your GenAI Goals While Mitigating the Risks

Pressure is mounting to create Generative AI solutions—but cautions abound. Here’s how to meet your goals, while reducing costs and risks.

Featured research

Your Blueprint to Generative Answering in Digital Self-Service, Building Next-Gen Experiences

Get real-world advice from 30+ enterprises on how to reduce hallucinations, boost security and design the ultimate generative experience.

Featured research

Research Report

Artificial Intelligence and The Digital Workplace: Altering the Future of Work

Nearly 1,400 executives give us their thoughts on AI, machine learning and new generative AI tools

Most Read Today

Article

Nvidia to Invest $100B in OpenAI and Become Key AI Chip Supplier

Article

Visier AI Manager Agent: Combat Frontline Manager Burnout

Article

Vertesia Launches Secure AI Assistant for Enterprise Workflows

Most Read Today

Article

Nvidia to Invest $100B in OpenAI and Become Key AI Chip Supplier

Article

Visier AI Manager Agent: Combat Frontline Manager Burnout

Article

Vertesia Launches Secure AI Assistant for Enterprise Workflows

Most Read Today

Article

Nvidia to Invest $100B in OpenAI and Become Key AI Chip Supplier

Article

Visier AI Manager Agent: Combat Frontline Manager Burnout

Article

Vertesia Launches Secure AI Assistant for Enterprise Workflows