November 13, 2023

An Analogy for the Current Wave of AI Copyright Lawsuits

McDonnell Boehnen Hulbert & Berghoff LLP

+ Follow Contact

Send

Embed

We are at the beginning of what promises to be a wave (potentially a tsunami) of complaints filed against the companies behind generative AI models (e.g., OpenAI). Recent lawsuits from Paul Tremblay and Mona Awad (Tremblay and Awad v. OpenAI Inc. et al. -- Northern District of California, No. 3:23-cv-03223), Sarah Silverman (Silverman v. OpenAI, Inc. -- Northern District of California, No. 3:23-cv-03416-AMO), and the Authors Guild (Authors Guild et al v. OpenAI Inc. et al. -- Southern District of New York, No. 1:23-cv-8292)[1] contend that OpenAI and others have hoovered up thousands of copyrighted publications, including those of the named plaintiffs, and used them to train large language models (LLMs) such as GPT-4. As these initial cases proceed, and possibly go up on appeal, they are likely to define the contours of how copyright law applies to the new world of generative AI and whether is it proper to train such models on copyrighted works without permission to do so.

The authors' theories of infringement vary as do their ancillary claims. While acknowledging the risk of over-simplify complex issues, we can boil the merits of these cases down to two main questions:

1) Is the ingestion of a copyrighted work into the training process of an LLM without the author's permission an infringement of the copyright?

2) What if an LLM trained in this fashion produces a new work that is substantially similar to the copyrighted work?

These questions can be thought in terms of pigs and sausage.[2] Pigs can be turned into sausage, but it is generally accepted to be impossible to turn sausage back into a pig. Mathematicians would consider the transformation from pig to sausage to be an irreversible one-way function.

It is important to understand that all computer data are just organized collections of numbers. This includes digital copies of books, images, audio, video, web sites, etc. When a machine learning model such as an LLM is trained on a digital book, the arrangement of numbers representing the words, punctuation, front matter, and so on are transformed into a different arrangement of numbers -- weights in a complex set of neural networks.

In most cases, there is no one-to-one mapping between the numbers used before and after transformation. One cannot point to a particular set of numbers in an LLM and identify a Game of Thrones novel. Indeed, the weights in an LLM are a complex amalgam of most or all data on which it was trained. Even the entities that design and build LLMs have yet to provide an understanding of what the weights actually represent.

So this leads to a likely answer to the first question. A similar set of facts were considered by the Second Circuit in Authors Guild, Inc. v. Google, Inc., in the context of using copyrighted books for search purposes. The Court ultimately ruled that the conversion of the copyrighted content into a form useful for searching was highly transformative, displaying small portions of the books was fair use, and such search and display did not provide a significant market substitute for the original works. Therefore, the mere use of a copyrighted work to train an LLM, even without permission, is unlikely to be a winning fact pattern.

But the emergent magic of LLMs is that they might know enough about an ingested Games of Thrones novel to be able to produce its plot summary, a list of main characters, and even quote a section or two.[3] These uses might also fall under the Second Circuit's definition of fair use.

But an LLM may be able to produce significant portions of the work or the work as a whole.[4] Or, the LLM may be able to generate alternative endings to the novel, new works in the style of the author, or new works involving the same characters and relying on the authors' world-building.

Thus, the answer to the second question is not clear, though it seems that the LLM would have to provide "more than just a little" of the copyrighted work. For example, copyright famously protects actual works and not styles. This issue may boil down to whether an LLM can reverse the transformation function and turn sausage back into a reasonable semblance of a pig, as well as whether an LLM operator can successfully prevent it from doing so.

As noted, the cases currently being litigated may provide some clarity -- or, depending on how they proceed, maybe not. Also, Congress may step in and define new causes of action that specifically target LLMs and similar fact patterns.

Authors may ultimately have their strongest positions where they can argue that the operator of the LLM is unjustly enriching itself on the backs of the authors' labor or effectively competing in the same marketplace as the authors. At first blush it seems that imaging tools based on generative AI (e.g., DALL-E, Midjourney, and others), the use of which can eliminate the need for human illustrators, might be a better target for such claims.

[1] Here, the group of authors named in the complaint include John Grisham, George R. R. Martin, Jodi Picoult, and Scott Turow.

[2] Vegans should feel free to replace "pigs" with "plant-based protein."

[3] OpenAI appears to be aware of the issues that this capability might raise. If you ask ChatGPT 4 to "provide a Jon Snow quote from Game of Thrones," it falls back on a Bing search to do so.

[4] This is theoretically possible, though OpenAI and others have put guardrails in place in attempts to prevent their models from such blatant infringement.

[View source.]

Send Print Report

DISCLAIMER: Because of the generality of this update, the information provided herein may not be applicable in all situations and should not be acted upon without specific legal advice based on particular situations.

Written by:

McDonnell Boehnen Hulbert & Berghoff LLP

Contact + Follow

Michael Borella Ph.D.

+ Follow

less

Published In:

Artificial Intelligence

+ Follow

Authorship

+ Follow

Innovative Technology

+ Follow

Intellectual Property Protection

+ Follow

Machine Learning

+ Follow

Civil Procedure

+ Follow

Intellectual Property

+ Follow

Science, Computers & Technology

+ Follow

less

McDonnell Boehnen Hulbert & Berghoff LLP on:

An Analogy for the Current Wave of AI Copyright Lawsuits

Related Posts

Written by:

Published In:

McDonnell Boehnen Hulbert & Berghoff LLP on:

"My best business intelligence, in one easy email…"