An employee writes code on a laptop computer with two external monitors.
News

OpenAI Shares Benchmark MLE-bench for AI Agents

Chris Ehrlich avatar
By
SAVED
What tasks the does the benchmark evaluate?

SAN FRANCISCO — The generative AI pioneer OpenAI released an open-source benchmark to evaluate AI agents.

OpenAI posted MLE-bench to GitHub to help developers determine how well AI agents perform machine learning (ML) engineering, according to the company yesterday.

The company is sharing the benchmark code to "facilitate future research" to understand AI agent capabilities on the tasks.

The MLE-bench benchmark features a set of "challenging tasks" that test real-world ML engineering skills, such as training models, preparing data sets and running experiments.

To develop its MLE-bench benchmark, OpenAI curated 75 ML engineering-related competitions from Kaggle.

OpenAI established human baselines for each competition, using Kaggle's publicly available leaderboards.

A team of OpenAI AI researchers also address resource scaling for AI agents and the impact of contamination from pre-training in a companion 27-page paper on the benchmark code: "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering."

About the Author
Chris Ehrlich

Chris Ehrlich is the former editor in chief and a co-founder of VKTR. He's an award-winning journalist with over 20 years in content, covering AI, business and B2B technologies. His versatile reporting has appeared in over 20 media outlets. He's an author and holds a B.A. in English and political science from Denison University. Connect with Chris Ehrlich:

Main image: By Christina @ wocintechchat.com.
Featured Research