News

OpenAI Shares Benchmark MLE-bench for AI Agents

By Chris Ehrlich

October 11, 2024

SAVED

What tasks the does the benchmark evaluate?

SAN FRANCISCO — The generative AI pioneer OpenAI released an open-source benchmark to evaluate AI agents.

OpenAI posted MLE-bench to GitHub to help developers determine how well AI agents perform machine learning (ML) engineering, according to the company yesterday.

The company is sharing the benchmark code to "facilitate future research" to understand AI agent capabilities on the tasks.

The MLE-bench benchmark features a set of "challenging tasks" that test real-world ML engineering skills, such as training models, preparing data sets and running experiments.

To develop its MLE-bench benchmark, OpenAI curated 75 ML engineering-related competitions from Kaggle.

OpenAI established human baselines for each competition, using Kaggle's publicly available leaderboards.

A team of OpenAI AI researchers also address resource scaling for AI agents and the impact of contamination from pre-training in a companion 27-page paper on the benchmark code: "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering."

About the Author

Chris Ehrlich

Chris Ehrlich is the former editor in chief and a co-founder of VKTR. He's an award-winning journalist with over 20 years in content, covering AI, business and B2B technologies. His versatile reporting has appeared in over 20 media outlets. He's an author and holds a B.A. in English and political science from Denison University. Connect with Chris Ehrlich:

Main image: By Christina @ wocintechchat.com.

Tags

ai agents news openai ai news artificial intelligence machine learning

Featured Research

Featured research

A data readiness framework for leaders preparing generative and agentic AI for real-world autonomy

Featured research

What conversational AI reveals about orchestration, workflow redesign and execution at scale

Featured research

What high-scale AI systems reveal about context, orchestration and why most AI fails between experimentation and execution.

Featured research

Forget CCaaS bolt-ons. This blueprint shows how to scale with AI as your command center.

Featured research

AI can cut handling time by 70% and boost CSAT by 32%. This white paper shows how to build around it.

Featured research

AI’s biggest impact isn’t always flashy—it’s the smarter, quieter stuff that actually moves the needle.

Featured research

When customer demand shifts by the minute, your workforce strategy needs more than spreadsheets and guesswork it needs agile, AI-powered WFM built for digital service.

Featured research

Insights and Trends from Organizational Adoption

Featured research

Pressure is mounting to create Generative AI solutions—but cautions abound. Here’s how to meet your goals, while reducing costs and risks.

Featured research

Get real-world advice from 30+ enterprises on how to reduce hallucinations, boost security and design the ultimate generative experience.

Featured research

Nearly 1,400 executives give us their thoughts on AI, machine learning and new generative AI tools

Most Read Today

Article

OpenAI Codex App for macOS: A Multi-Agent AI Development Command Center

Article

Mistral Shrinks Speech-to-Text Latency With Voxtral Transcribe 2

Article

Poetiq Raises $45.8M Seed Round to Build Self-Improving AI Agents

Most Read Today

Article

OpenAI Codex App for macOS: A Multi-Agent AI Development Command Center

Article

Mistral Shrinks Speech-to-Text Latency With Voxtral Transcribe 2

Article

Poetiq Raises $45.8M Seed Round to Build Self-Improving AI Agents

Most Read Today

Article

OpenAI Codex App for macOS: A Multi-Agent AI Development Command Center

Article

Mistral Shrinks Speech-to-Text Latency With Voxtral Transcribe 2

Article

Poetiq Raises $45.8M Seed Round to Build Self-Improving AI Agents