Since ChatGPT was released in November 2022 by OpenAI and quickly followed by other generative artificial intelligence (GenAI) platforms from other providers, they have been almost constantly in the news. GenAI are algorithms that can create newly generated content, such as text, images, videos, audio, and even computer codes, in response to prompts input by the user. These AI systems have already demonstrated their vast capability to produce new content, and, depending on one’s perspective, have the potential to aid or devastate human beings. For example, ChatGPT can create responses in a natural, human-like language to any question or request that a user poses. DALL-E can generate images in response to the user’s prompts. Thus, GenAI can help the creative and other industries by expanding their ability to produce new material efficiently and quickly, increasing productivity and efficiency. But it can also be seen as an existential threat to them by making human creators largely obsolete.
Regardless of how one views GenAI, there is no denying that many legal issues are attendant to it, one of the most important being intellectual property, specifically copyright. Large language model GenAIs, like ChatGPT and DALL-E, are trained on vast amounts of written human language and textual and graphic data obtained by web scraping, feeding on materials including books, essays, articles, Wikipedia, images, and other web pages. For example, ChatGPT-3 was trained using a massive 570 GB of data from various sources of content on the internet. It ingested approximately 300 billion words and continues to evolve in later iterations. This training allows it to “understand” various topics and respond to human-input prompts in natural-sounding conversations that mimic human thinking and speech, spitting out new content generated based on the training data.
With GenAI, there are two points where copyright issues arise: at the point of input of the training data into the AI model and at the point of output of AI-generated content based on the training data.
As stated above, the massive training data is scraped from the internet (copied and fed into the AI software for “machine learning”), including many protected works, such as whole books. Using such material without permission of the copyright holders would allegedly constitute violations of the exclusive rights of authors set forth in 17 U.S.C. 106. Considering the vast amount of data scraped from the internet for training, it is hard to imagine how AI developers would contact each and every copyright holder to obtain permission or negotiate a license to use their copyright material. Instead, the AI developers assert that their use constitutes a fair use of such material. Fair use is enshrined in 17 U.S.C. 107. It allows using another’s copyright-protected material without committing infringement under certain circumstances, based on consideration of the factors outlined in that section. Whether a given use in a particular case qualifies as fair use depends on applying these factors to the facts of that case. One factor is “the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.” Serving nonprofit education purposes would push the use toward a finding of fair use. Of course, this has to be counter-balanced with findings based on the remaining factors, as findings based on any one factor cannot be dispositive.
ChatGPT is used in commerce. Even though the basic version is currently free, OpenAI, its developer also has a premium version, ChatGPT Plus, that one can subscribe to for a monthly fee. OpenAI is monetizing what has become one of the most talked about technological advancements of recent years. Another factor to consider in fair use determination is “the amount and substantiality of the portion used in relation to the copyrighted work as a whole.” When an entire book is ingested as training data, there is no need to question the substantiality of the portion used. This would push the use away from a finding of fair use.
In Thomson Reuters Enterprise Centre GmbH et al. v. ROSS Intelligence Inc., 20 -cv-00613 (D. Del.), the parties filed cross-motions for summary judgment on fair use defense. This case involves allegations that ROSS used proprietary content from Thomson’s Westlaw research database without authorization to train its GenAI legal research tool. ROSS asserts fair use, claiming that such use is for research purposes in creating its GenAI model. Further, ROSS claims that its taking of the headnotes and key numbers from the Westlaw database has no adverse effect on the marketability of Westlaw’s database. The impact on the market for copyrighted material is another factor in fair use determination.
Thomson counters and asserts that ROSS’s purpose is, instead, to create a competing commercial legal research tool and that its action adversely affects the market value of Westlaw subscription services. A use harming the market value of the copyrighted material would push the use away from fair use. The nature of the copyrighted work being used, creative or factual, is another factor. Many people will be watching the court’s ruling.
Aside from the Thomson case, which has identified parties and identified data, Tremblay v. OpenAI., Inc. case 3:23-cv-03223 (N.D. Cal), filed on June 28, 2023, involves two named plaintiffs who are authors and are bringing a class action against OpenAI on behalf of themselves and “[A]ll persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models ---.” They claim that OpenAI violated their copyright in their copyright-registered books and those of the putative class of authors when it used the books in the ChatGPT training data set without the authors’ authorization or compensation to them. The plaintiffs seek statutory and actual damages and attorneys’ fees under the Copyright Act and a permanent injunction.
Another case asserting copyright infringement in training large language models is Silverman et al. v. OpenAI, Inc. et al., case 3:23-cv-03416 (N.D. of Cal), filed on July 7, 2023. This case received more mention in the news, perhaps due to the celebrity of one of the plaintiffs, Sarah Silverman, writer and performer. The plaintiffs here also allege, on their behalf and on behalf of “all others similarly situated,” copyright infringement on a mass scale by OpenAI in using their (copyrighted) books as a training data set.
With GenAI, copyright issues are present not only regarding the input data used for training but also regarding the contents output by the GenAI. The product of GenAI in response to prompts could very well contain snippets or even larger portions of the data on which it was trained. If the training data set was copyrighted, then portions of it reproduced or included in the AI’s output without the copyright holder’s authorization would constitute infringement. Further, even if not exactly reproduced, the output containing modified training data versions can be considered derivative work. Again, without permission of the author, this would violate the author’s exclusive right to create derivative works.
In late 2022, a couple of anonymous coders sued GitHub, Inc. on their behalf and on behalf of “all others similarly situated” relating to Copilot, an AI coding tool for software development. J. Doe1 and J. Doe2 v. GitHub, Inc., case 4:22-cv-06823-JST (N.D. Cal). Copilot generates codes by making suggestions in real-time as the user writes code, rather like auto-completion. Plaintiffs asserted that Copilot output often contained copyrighted material from its training data, sometimes quoted verbatim, but did not comply with the terms of applicable licenses, including open-source licenses, under which users of GitHub had placed their lines of code on public repositories. Such noncompliance included failure to place copyright notice or any attribution to the code's original author and failure to include the terms of the applicable licenses in the output. The defendant countered that the terms of the GitHub license granted GitHub the authority to store, use, and share the content of the public repositories with other users. The court denied the plaintiffs’ demands for damages but said they could continue pursuing injunctive relief against reproduction without proper attribution. This case was more of a breach of contract claim than copyright infringement. However, with GenAI, copyright issues are never far.
Another aspect of copyright relating to the output of AI is whether any such output product is eligible for copyright protection under U.S. copyright law. The Compendium of U.S. Copyright Office Practices states that the Office will register an original work of authorship “provided that the work was created by a human being.” It states further that “[B]ecause copyright law is limited to ‘original intellectual conceptions of the author,’ the Copyright Office will refuse to register a claim if it determines that a human being did not create the work.” In this spirit, the U.S. Copyright Office (USCO) rejected the registration of an image titled “A Recent Entrance to Paradise” generated entirely by an AI called DABUS, a creation of Stephen Thaler. Thaler filed a lawsuit against Shira Perlmutter in her official capacity as Register of Copyrights and Director of USCO and subsequently filed a motion for summary judgment, asking the court to issue a ruling to compel the USCO to set aside its refusal to register the work for copyright. The defendant filed a cross-motion for summary judgment. There being no facts in dispute, the only question facing the court is whether AI-generated works (created with no human input) are copyrightable as a matter of law.
On August 18, 2023, the U.S. District Court for D.C. granted summary judgment in favor of the USCO and held that copyright protection was not available for a work created autonomously and entirely by an AI program because it lacked the human authorship necessary for copyright protection. The court did reserve, however, the question of whether a human AI user could hold copyright in work produced by the AI if the user sufficiently contributed to the creation of the work by providing detailed prompts and other post-production efforts.
Things became even more complicated with “Zarya of the Dawn,” a graphic novel created by an artist who used Midjourney (an image-generating AI) to generate images to accompany her texts. The USCO initially granted registration for the full work, but when it became aware that AI created part of the work, it revoked copyright protection for that part. The USCO determined that copyright protection was not available for the images produced by Midjourney because the process of their creation lacked sufficient human involvement. After inputting text prompts, the user of the AI had no control over how the creation occurred and could not predict what images the AI would produce in response to the prompt. However, the USCO did say that the artist’s original texts and the particular arrangement of the text with certain selected images among those created by the AI are copyrightable because this involved sufficient human creative control. Accordingly, the USCO opined that only the expressive material the artist created is eligible for copyright registration.
Without a doubt, copyright in the age of GenAI is a fast-evolving area that is in flux at present. In its efforts to provide some guidance, the Copyright Office has issued a policy statement describing how the USCO examines and registers works that contain AI-generated material and how it applies the requirement for human authorship when examining such works. The USCO also conducted several listening sessions in 2023, where participants were encouraged to express their hopes, concerns, and questions about GenAI and copyright law. Additionally, the USCO has been conducting webinars as a part of its initiative to examine copyright law and policy issues raised by AI technology, specifically addressing the use of copyrighted materials in AI training and the scope of copyright protection in works generated using AI tools.
In the fast-changing world of GenAI where coverage of it since ChatGPT’s debut has been nothing short of breathless, opinions differ as to how existing copyright law applies to it and how the law should develop to keep pace with GenAI’s rapid advancements. One thing is clear. At present, whether data scraped from the internet can be used for training AI models legally, without consent of the owners of copyright in the scraped material, remains unanswered. So is the question of whether GenAI's content can qualify for copyright protection.
We await further guidance from the courts, the U.S. Copyright Office, and legislative direction from Congress.