Share on Facebook
Share on X
Share on LinkedIn

In July of this year, comedian and author Sarah Silverman and similarly situated co-plaintiffs filed a class action complaint in the United States District Court for the Northern District of California against OpenAI and related entities. Silverman and her co-plaintiffs alleged infringement of their held copyrights under 17 U.S.C. § 106 by OpenAI. The lawsuit quickly made headlines, as commentators noted that the case’s outcome may have far-reaching implications for the future of artificial intelligence.

OpenAI and ChatGPT

OpenAI is an artificial intelligence research laboratory founded by some of the most important names in tech and venture capital—including Sam Altman of the legendary start-up incubator Y Combinator, Peter Thiel, and Elon Musk. OpenAI opened shop as a nonprofit to ensure that “artificial general intelligence—highly autonomous systems that outperform humans at most economically valuable work—benefits all humanity.”

The Importance of Safety of AI Systems

By concentrating talent and cutting-edge research, OpenAI would leverage its influence and research to ensure the safety of AI systems. “To be effective at addressing AGI’s impact on society,” reads the OpenAI Charter, “OpenAI must be on the cutting edge of AI capabilities—policy and safety advocacy alone would be insufficient.”

Eight years after its founding, OpenAI has become one of the leading outfits developing artificial intelligence technology. In late 2022, OpenAI stunned the world with its release of ChatGPT. This interactive chatbot had been trained on extremely large datasets and plausibly resembled a human conversationalist or research assistant. While earlier iterations of the GPT technology had been impressive, ChatGPT became a cultural phenomenon and showed laypeople the potential power of artificial intelligence.

Sarah Silverman’s Lawsuit

To understand the claims made by Sarah Silverman and her co-plaintiffs, first, it is essential to understand how ChatGPT works. ChatGPT is one iteration of a larger suite of artificial intelligence systems called “large language models,” or LLMs.

How do LLMs work?

LLMs use extremely large datasets; datasets made possible by the vast amount of information digitally stored on the internet. Millions of books, papers, and social media posts—representing a significant portion of human writing output—have been rendered legible to machines.

Using these large datasets, the LLMs are trained to predict the next word in a sequence. With advanced fine-tuning and training, this predictive power can resemble comprehension, understanding, and complex thought.

Because the datasets are so large, invariably, there is material that is copyrighted or otherwise protected. Indeed, Silverman and her co-plaintiffs allege that their copyrighted works were included in the training set.

What Are Shadow Libraries?

Collections of digitally rendered complete books can be extremely valuable. If a book was not originally published in digital format, it must be scanned page by page. Books are particularly useful for training LLMs because they represent long strings of comprehensible, complex information, which is perfect for training an LLM to mimic human thought and speech.

Books are not the only valuable source for AI training data. Scholarly papers, institutional research, blogs, journal articles, and essays serve the same purpose.

But these collections must come from somewhere. Often, the best sources are file-sharing websites that bypass traditional gatekeeping measures. These sites are to the written word what Napster was to music: unregulated and free, allowing access to books and articles for which one would normally have to pay. 

These online collections, which render freely accessible previously difficult-to-acquire digital books and articles, are called “shadow libraries.”

Library Genesis and Silverman’s Lawsuit

Content that would normally be placed behind paywalls, copyrights, expensive subscriptions, or institutional access is instead made widely available to the public. One such shadow library is Library Genesis, one of the largest such libraries on the internet. Indeed, as of 2019, LibGen claimed that it hosted more than 4 million fiction and nonfiction titles

While an individual may find LibGen useful, Silverman and her co-plaintiffs have alleged that OpenAI used LibGen—or a similar shadow library—to build its “Books2” dataset, estimated to have 300,000 unique titles. ChatGPT and its prior iterations, alleges Silverman, were trained on this “shadow library” material, including her own book and the copyrighted material belonging to other class members. 

LibGen’s Infringement on Copyrights

Consequently, Silverman alleges that the class of plaintiffs represented by her suit—all those who hold copyrights to materials used in OpenAI’s training dataset—were injured by OpenAI’s unauthorized use of their work.

As the plaintiffs note in their complaint, “[p]laintiffs never authorized OpenAI to make copies of their books, make derivative works, publicly display copies…or distributes copies…All those rights belong exclusively to Plaintiffs under copyright law.”

Silverman’s complaint offers no direct proof that OpenAI used LibGen or another shadow library to compile its dataset. Instead, the plaintiffs’ reason is that OpenAI could have only built such a large training set utilizing shadow libraries, as these are the only sources for such a wide range, variety, and number of digitized titles.

Fair Use? OpenAI’s Possible Defenses

Whether Silverman is correct in alleging that OpenAI used her book to train its AI, or whether the allegation that OpenAI used shadow library materials as the source for its datasets is true or not, is perhaps beside the point—OpenAI’s datasets almost certainly contain copyrighted information based on its sheer size. A training set that only used non-copyrighted materials would be far less expansive and diverse and, therefore, less useful in training LLMs.

What can be considered fair use?

The question then becomes one of fair use. The Copyright Act allows for the “fair use” of copyrighted work for specific purposes, such as:

  • Criticizing;
  • Commenting;
  • Reporting News;
  • Teaching;
  • Scholarship; and/or
  • Research

The law specifies factors a court should consider in determining whether such use is fair: the purpose and character of the use. Specifically, courts look to whether the alleged infringer is commercially using the copyrighted material or whether it is used for educational purposes (not to make a profit). Other factors include:

  • The nature of the copyrighted work;
  • How much of the work is used relating to the copyrighted work compared to the copyrighted work as a whole; and
  • The effect using the copyrighted work will have on the potential market for (or value of) the copyrighted work.

Fair Use and OpenAI

In this, OpenAI may have the upper hand. As the court noted recently in a well-publicized copyright case in the term just ended, “the first fair use factor…considers the reasons for, and nature of, the copier’s use of an original work. The ‘central’ question it asks is ‘whether the new work merely “supersedes the objects” of the original creation…or instead adds something new, with a further purpose or different character.”

Is the work transformative?

The first factor in fair use doctrine gets to the purposes of copyright in the first place – to protect original, creative work. The use of copyrighted material, which resulted in a unique and transformative work, would not run afoul of the spirit of copyright law. If the new work could not easily substitute for the original, then its transformative quality furthers the goals of copyright in “enriching public knowledge.”

In their complaint, Silverman and her co-plaintiffs allege that OpenAI “made copies” of their books. Furthermore, since the language models “cannot function without the expressive information extracted from Plaintiffs’ works…the OpenAI Language Models are themselves infringing derivative works…”

Does the work resemble the original copyrighted work?

On the other hand, OpenAI’s product—the generative AI called “ChatGPT”—bears no resemblance to the copyrighted works. Indeed, unless one knew for certain that a specific book was used in OpenAI’s training set, you would not be able to know, merely from interacting with it, whether it was used at all. ChatGPT is not a mere regurgitation of the texts which comprise it—instead, it is something new, emergent from its training on the datasets but not identical to them.

Does it further a purpose?

As Justice Sotomayor wrote in Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith,  “the first fair use factor considers whether the use of a copyrighted work has a further purpose or different character, which is a matter of degree… if an original work and a secondary use share the same or highly similar purposes, and the secondary use is of a commercial nature, the first factor is likely to weigh against fair use…”

ChatGPT is not a copy of The Bedwetter, Silverman’s book, but a completely different product. In many derivative works which utilize copyrighted material, it is clear, on the face of it, that the copyrighted material has been copied. The same can hardly be said of ChatGPT and the works on which it was trained.

Future Litigation: All but Assured?

Silverman’s class action has opened the door for others. Copyright infringement is notoriously case-specific, and a defense verdict in one case may not dissuade future plaintiffs from filing their own causes of action. Like any infant industry, the spectacular advances in artificial intelligence will come with growing pains as courts, legislatures, and litigants attempt to fit our current statutory and legal framework to new realities.

Rulings in cases like Silverman’s will undoubtedly have a deep impact on the future of Large Language Models – indeed, this issue will likely make it to the Supreme Court before long. For now, all eyes are on the Northern District of California. 

Questions?

If you have questions about your copyrighted work and potential infringement from AI, or if you are using AI to create your own work, contact us today for a free case evaluation.

About the Author
Patrick Ivy knows the goal of a contract in day-to-day operations is to achieve commercial objectives on time and on budget while also managing risk.