SAN FRANCISCO — The generative AI pioneer OpenAI released an open-source benchmark to evaluate AI agents.
OpenAI posted MLE-bench to GitHub to help developers determine how well AI agents perform machine learning (ML) engineering, according to the company yesterday.
The company is sharing the benchmark code to "facilitate future research" to understand AI agent capabilities on the tasks.
The MLE-bench benchmark features a set of "challenging tasks" that test real-world ML engineering skills, such as training models, preparing data sets and running experiments.
To develop its MLE-bench benchmark, OpenAI curated 75 ML engineering-related competitions from Kaggle.
OpenAI established human baselines for each competition, using Kaggle's publicly available leaderboards.
A team of OpenAI AI researchers also address resource scaling for AI agents and the impact of contamination from pre-training in a companion 27-page paper on the benchmark code: "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering."