The New York Times Wants ChatGPT Gone. Nice Try

Copyright murkiness is about the only sure thing that lies ahead

4 min read
3D Illustration of lit squares forming the letters GPT with wires intertwined with the logo of the NY Times.
Getty Images/IEEE Spectrum

The battle between copyright holders and generative AI companies is heating up, and The New York Times is leading the charge.

The publication recently filed a lawsuit against OpenAI and Microsoft that claims copyright infringement, trademark dilution, and unfair competition. And it’s not pulling its punches. The suit seeks not just monetary compensation but also the destruction of all the defendant’s LLM models and training data, as well as a halt to unlicensed training on the publication’s articles.

“When you have these big technology shifts, the law has to adjust,” says Cecilia Ziniti, a Silicon Valley attorney. “This case is so historic because The New York Times has millions and millions of words that are used for training. So, extrapolating out, now is the time this is going to be regulated and get looked at.”

The showdown between copyright holders and AI companies intensifies

The New York Times’s lawsuit against OpenAI and Microsoft is the latest in a string of complaints against generative AI companies. Getty Images filed suit against StabilityAI, creator of the image generation tool Stable Diffusion, in early 2023, and several music publishers filed suit against Anthropic, creator of, in October.

But The New York Times’s suit is notable for its scope. It accuses the defendants of “copying and using millions of The Times’s copyrighted [articles].” The claim is supported by 100 examples of ChatGPT reproducing near-exact copy from New York Times articles.

“Whenever you have a verbatim copy, that’s a replacement, and that’s going to be pretty colorable [plausible to the court],” says Ziniti. “The New York Times also has enough of a library, going back to 1851, that they can actually say some percentage of the training data was New York Times.”

A selection of docuements from The New York Times vs. OpenAI / Microsoft lawsuit. The docuements demonstrate that ChatGPT can reproduce sections of text that closely resemble articles published by The New York Times.The New York Times’s lawsuit against OpenAI and Microsoft provides examples of ChatGPT producing text similar to the publication’s articles.The New York Times

Even so, the suit’s victory isn’t certain. Mike Masnick, founder and editor of the technology policy publication Techdirt, points out that prior cases, such as the Authors Guild lawsuit against Google Books, set a precedent that may protect the use of copyrighted data to train AI.

“I go back to the most important similar case, which is the Google Books case,” says Masnick. “[Google] scanned books in order to create a giant search engine of books. That was very much a commercial entity, for a commercial purpose...that involved scanning entire copyrighted books and building a massive index of all those works.” Google argued that Google Books was transformative fair use and prevailed.

And there’s yet another complication: recent agreements between OpenAI and other publishers, such as Axel Springer and the Associated Press. The exact terms of the deals are unknown, but a press release from OpenAI states its deal with Axel Springer will help the publisher summarize “selected global news.” OpenAI will in turn obtain content from Axel Springer to use for future AI training. OpenAI also reached an agreement with the Associated Press.

“The fact that OpenAI made deals with others shows there is a market for this particular use for data,” says Ziniti. Masnick is more skeptical that these agreements will be a factor, but notes that “a judge can decide whatever they want.... It’s not very predictable.”

Prepare for years of copyright confusion

The lack of clarity on where the rights of copyright holders end, and those of companies creating generative AI begin, makes it impossible to know precisely where the law will settle. That’s not going to change overnight. The Authors Guild, Inc. vs. Google, Inc. took a decade to resolve.

Legislation can move more quickly. The European Parliament has reached a “provisional agreement” for an Artificial Intelligence Act, which includes restrictions on how training data can be obtained and used. But even this act is not yet law, and it’s unclear when the regulations it outlines will come into effect.

In the meantime, companies and organizations training AI face a potential minefield—and may want to keep an eye on the source of data used for training. “From an engineering point of view, when you add a new source to your dataset to train on, you keep track of that, figure out where you looked at it, note the terms of use for the website,” says Ziniti. She notes that training from sources that include a wide variety of data, such as Common Crawl, is more defensible than training on a narrow dataset.

Developers accessing a third-party AI model (like GPT-4) to build an app or service have less reason to be concerned. Some major providers of AI models, including OpenAI, have offered to defend customers against copyright claims that result from using its models. End users probably won’t find themselves on the business end of a lawsuit, either, though the details matter. Purposefully prompting an AI tool to infringe on a copyrighted work isn’t a great idea.

Whenever it’s resolved, the impact of The New York Times’s suit will be significant. Its failure would allow those looking to train new AI models to proceed as they have in the past with far less concern about future lawsuits. Success wouldn’t entirely halt the training of new AI models, but licensing costs may increase the cost of training beyond the means of all but the most well-funded companies.

Masnick says if training is deemed not subject to fair use protections, then only the big players will be able to afford it. “Suddenly you’ve wiped out a bunch of smaller players,” he says, “and potentially wiped out open-source AI models.”

The Conversation (0)