The New York Times Case against OpenAI is Different. Here's Why.

McDonnell Boehnen Hulbert & Berghoff LLP
Contact

McDonnell Boehnen Hulbert & Berghoff LLP

On December 27, 2023, The New York Times Company ("The Times") sued several OpenAI entities and their stakeholder Microsoft ("OpenAI") in the Southern District of New York for copyright infringement, vicarious copyright infringement, contributory copyright infringement, violation of the Digital Millennium Copyright Act (DMCA), unfair competition, and trademark dilution (complaint). Unlike other high profile copyright actions brought against OpenAI (e.g., by the Author's Guild, Julian Sancton et al., Michael Chabon et al., Sarah Silverman et al., Paul Tremblay and Mona Awad, et al.), The Times' allegations exhibit a remarkable degree of specify. This will make it difficult for OpenAI to establish that (i) its generative AI models were not trained on copyrighted content of The Times, and (ii) that OpenAI was engaging in fair use if and when it did so.

The complaint centers around OpenAI's large language model (LLM) chatbot, ChatGPT. As described by The Times:

An LLM works by predicting words that are likely to follow a given string of text based on the potentially billions of examples used to train it . . . . LLMs encode the information from the training corpus that they use to make these predictions as numbers called "parameters." There are approximately 1.76 trillion parameters in the GPT-4 LLM. The process of setting the values for an LLM's parameters is called "training." It involves storing encoded copies of the training works in computer memory, repeatedly passing them through the model with words masked out, and adjusting the parameters to minimize the difference between the masked-out words and the words that the model predicts to fill them in. After being trained on a general corpus, models may be further subject to "finetuning" by, for example, performing additional rounds of training using specific types of works to better mimic their content or style, or providing them with human feedback to reinforce desired or suppress undesired behaviors.

Once trained, LLMs may be provided with information specific to a use case or subject matter in order to "ground" their outputs. For example, an LLM may be asked to generate a text output based on specific external data, such as a document, provided as context. Using this method, Defendants' synthetic search applications: (1) receive an input, such as a question; (2) retrieve relevant documents related to the input prior to generating a response; (3) combine the original input with the retrieved documents in order to provide context; and (4) provide the combined data to an LLM, which generates a natural-language response.

Put another way, the parameters of an LLM like ChatGPT can be thought of as a compressed amalgam of its training data, represented in a way that preserves the wording, grammar, and semantic meaning of the original works. When queried, ChatGPT produces output consistent with this compressed representation.

Based on publicly available information, The Times alleges that a relatively large portion of the content used to train various versions of GPT were from its web site, an estimated millions of individual works. Further, and even more compelling, The Times provides numerous samples of ChatGPT being able to generate near verbatim copies of its articles. One such example is reproduced below:

Image

This comparison is stunning. The Times alleges that it got ChatGPT to produce the output with "minimal prompting" but did not provide a specific prompt or series of prompts that it used to do so.[1] The output suggests that prominent training data that is emphasized in the training process can be represented in a nearly-uncompressed fashion in the resulting model. Thus, even if it is hard to point to exactly where the "copy" of an article resides amongst the 1.76 trillion parameters, the existence of such a copy should not be in question.

OpenAI responded publicly to the complaint in a January 8, 2024 blog post, stating that:

Memorization is a rare failure of the learning process that we are continually making progress on, but it's more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.

Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don't typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.

This is a strange response. It is essentially admitting to copying The Times' articles in question, but makes the non-legal arguments of "Hey, it was just a bug," and "The Times had to work hard and manipulated our model." Like saying "the dog ate my homework," neither of these excuses are likely to hold up under scrutiny.

Why is OpenAI seemingly shooting itself in the foot regarding actual copying? Because it is putting all of its eggs in the fair use basket.

Fair use is an affirmative defense written into the copyright statute that allows limited use of copyrighted material without permission from the copyright holder. It recognizes that rigid copyright laws can stifle dissemination of knowledge. Therefore, it attempts to balance copyright holders' interests in their creative works with the public's interest in the advancement of knowledge and education. Thus, the fair use doctrine acknowledges that not all uses of copyrighted material harm the copyright owner and that some uses can be beneficial to society at large.

Even so, OpenAI has a long and uncertain road ahead of it. Fair use is a notoriously malleable four-factor test that can be applied inconsistently from court to court. Furthermore, the interpretive contours of the test have evolved since its first appearance in the statute almost 60 years ago. Even the U.S. Copyright Office admits that "[fair use] fact patterns and the legal application have evolved over time . . . ."[2]

Predicting the outcome of a fair use dispute is often a fool's errand, even for those well-versed in copyright law. For example, the Supreme Court recently found fair use in the copying of 11,500 lines of computer code but not in the artistic reproduction of a photograph.[3] The outcome of a case can ride on which fair use factors the judge or judges find to be most relevant to the facts of the case and how they interpret these factors.

Fair use might not be a legal sniff test but it comes close. Nonetheless, let's take a look at each of the factors in order to understand the difficulties that OpenAI might run into when relying on this defense.

(1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.

Courts often view unlicensed copying for nonprofit education or noncommercial purposes as more likely to be fair use than those that are for commercial gain. In doing so, courts look to whether the use is transformative, in that it changes the original work in some manner, adding new expression or meaning, and does not just replace the original use.

OpenAI runs a for-profit business and charges for end-user access to its models. Further, the examples provided by The Times are much closer to verbatim copying than any type of transformative use. Therefore, this factor weighs against OpenAI.

(2) The nature of the copyrighted work.

This factor examines how closely the use of the work aligns with copyright's goal of promoting creativity. So, using something that requires a lot of creativity, like a book, film, or music, might not strongly back up a fair use claim compared to using something based on facts, like a technical paper or a news report.

Here, OpenAI has an angle as The Times produces a great deal of news reporting and cannot claim a copyright over basic facts. However, The Times' content includes many detailed articles explaining events and other facts in its writers' ostensibly creative voices. Moreover, investigating reporting is the uncovering and tying together of facts, which requires creative effort. At best, this factor is neutral for OpenAI.

(3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole.

In considering this factor, courts examine how much and what part of the copyrighted work is used. If a significant portion is used, it is less likely to be seen as fair use. But using a smaller piece makes fair use more probable. Copying even a small portion of a work might not qualify as fair use if it includes a critical or central part thereof.

This factor also weighs against OpenAI if we take as given The Times' allegations and evidence of almost-exact reproduction of its works.

(4) The effect of the use upon the potential market for or value of the copyrighted work.

This fourth factor may end up being the most important. The inquiry is whether the unauthorized use negatively affects the market for the copyright owner's original work. Courts look at whether the use decreases sales relating to the original work or has the potential to cause significant damage to its market if such use were to become common.

OpenAI will have a tough time establishing that it is not effectively free-riding off of The Times' investment in journalism. Especially since GPT-4 is being integrated into its minority owner Microsoft's Bing search engine. Once this integration matures, Bing will generate answers to search queries, and might not even link back to web sites (like that of The Times) from which it gleaned the underlying information used to formulate its answers. This could be devastating blow to The Times' revenue, as the company relies on subscriptions that allow users unlimited access to paywalled articles going back decades as well as advertising to these users.

To reiterate, fair use analyses are unpredictable. Judges can place virtually all of their emphasis on as little as one factor. Still, it is hard to imagine a scenario in which OpenAI wins a fair use dispute if the facts cited in the complaint hold up. A more likely result is that The Times and OpenAI quietly settle before such a decision is made.

[1] Trying to get the current versions of ChatGPT to produce this or any article from The Times is quite difficult and may not be possible. This may be due to OpenAI recently putting in place guardrails that prevent the model from producing near-verbatim output.

[2] https://www.copyright.gov/fair-use/

[3] See Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183 (2021) and Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 143 S. Ct. 1258 (2023).

[View source.]

DISCLAIMER: Because of the generality of this update, the information provided herein may not be applicable in all situations and should not be acted upon without specific legal advice based on particular situations.

© McDonnell Boehnen Hulbert & Berghoff LLP | Attorney Advertising

Written by:

McDonnell Boehnen Hulbert & Berghoff LLP
Contact
more
less

McDonnell Boehnen Hulbert & Berghoff LLP on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide