Books on curved library shelves.
News

Former OpenAI Researcher Stirs Fair-Use Debate

1 minute read
Chris Ehrlich avatar
By
SAVED
Does AI model training fall under fair use?

A researcher who worked for almost four years at OpenAI, including on ChatGPT, is questioning how generative AI companies train their models on copyrighted data and the fair-use legal defense.

Suchir Balaji has some in the AI community reacting online after his recent interview with The New York Times — which is suing OpenAI for copyright infringement — for the article "Former OpenAI Researcher Says the Company Broke Copyright Law" and his blog post "When Does Generative AI Qualify for Fair Use?"

Balaji started his career by working at OpenAI from 2020 to 2024. He lives in San Francisco and holds a B.A. in computer science from UC Berkeley.

In his post, Balaji compares legal fair-use factors with AI model training practices.

He claims the process of training a generative model "involves making copies of copyrighted data."

"If these copies are unauthorized, this could potentially be considered copyright infringement, depending on whether or not the specific use of the model qualifies as 'fair use,'" Balaji says.

Balaji claims the training inputs for a model are "full copies of copyrighted data, so the 'amount used' is the entirety of the copyrighted work."

AI products can then "create substitutes that compete with the data they're trained on," Balaji says in an X post.

On the wave of content licensing agreements signed by GenAI companies, Balaji says in the blog post "it’s unclear why these agreements would be signed if training on this data was fair use."

"Given the existence of a data licensing market, training on copyrighted data without a similar licensing agreement is also a type of market harm, because it deprives the copyright holder of a source of revenue," Balaji says.

From his perspective, Balaji closes by saying that "none" of the key legal factors seem to "weigh in favor of ChatGPT being a fair use of its training data."

OpenAI offers this response in the Times article:

Learning Opportunities

“We build our AI models using publicly available data, in a manner protected by fair use and related principles and supported by longstanding and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators and critical for U.S. competitiveness.”

On X, most of the comments on Balaji's post support his position.

Most of the professionals sharing the Times article on LinkedIn are asking if AI model training is a fair-use practice, and they're siding with creators.

See more: AI Copyright Infringement Quandary: Generative AI on Trial

About the Author
Chris Ehrlich

Chris Ehrlich is the former editor in chief and a co-founder of VKTR. He's an award-winning journalist with over 20 years in content, covering AI, business and B2B technologies. His versatile reporting has appeared in over 20 media outlets. He's an author and holds a B.A. in English and political science from Denison University. Connect with Chris Ehrlich:

Main image: By Patrik Göthe.
Featured Research