ChatGPT maker says 'it would be impossible' to train models without violating copyright

The New York Times in December filed a lawsuit alleging rampant copyright infringement against OpenAI and Microsoft.

Jan 9, 2024 - 03:30

0 54

ChatGPT maker says 'it would be impossible' to train models without violating copyright

In a Monday blog post, OpenAI — the company behind ChatGPT — published a lengthy response to the New York Times lawsuit filed against the company in late December.

The lawsuit alleges rampant copyright infringement in both the input and output of ChatGPT, which the Times argued represents a significant threat to its business.

OpenAI's position, however, is that it is already collaborating with other news organizations; copyright-infringing output is a "rare bug" and the company is working on reducing its frequency; training is fair use; and the Times is "not telling the full story."

The core of the difference in perspective between OpenAI and the Times is two different interpretations of the "fair use" doctrine, a component of copyright law that enables the limited use of otherwise copyrighted work.

The U.S. Copyright Office, which said in August it is undertaking a study of the law to better understand where generative AI fits in, declined to comment on the Times' lawsuit.

OpenAI's argument is that training its models on the internet at large is fair use.

"We view this principle as fair to creators, necessary for innovators and critical for U.S. competitiveness," the company said in a statement.

It is a view shared by many technologists, including computer scientist Andrew Ng, who recently said that, just as humans are allowed to learn from information on the internet, "AI should be allowed to do so, too."

If training on the open internet was made fair use, Ng said, "society will be better off." He did not elaborate on that point.

On the topic of AI training on copyrighted data, many people have echoed the argument made by Andrew Ng below. But it would be interesting to think about what copyright law would be like if humans had the ability to memorize entire books and recite them when prompted to do so. pic.twitter.com/lo8i2v6ypd— Melanie Mitchell (@MelMitchell1) January 8, 2024

But the issue is less of disallowing training on publicly available information and more of requiring the licensing of content that is powering commercial models which are so far generating enormous returns for investors.

OpenAI, which was founded in 2015, is now valued at a minimum of $86 billion and is reportedly in talks to raise funds at a valuation of $100 billion. Microsoft, its top investor, has a market cap of nearly $3 trillion and has poured $13 billion into OpenAI.

"The AI companies are working in a mental space where putting things into technology blenders is always okay," copyright expert and Cornell professor of digital and information law James Grimmelmann told TheStreet. "The media companies have never fully accepted that. They've always taken the view that 'if you're training or doing something with our works that generates value we should be entitled to part of it.'"

OpenAI: "It would be impossible" to train without violating copyright

OpenAI, according to the Daily Telegraph, submitted a statement to the House of Lords communications and digital committee explaining that, since copyright covers everything from blog posts to pictures and government documents, "it would be impossible to train today's leading AI models without using copyrighted materials."