OpenAI Generates a ‘Hired Gun Hacker’ Defense to the N.Y. Times Copyright Case

EDRM - Electronic Discovery Reference Model
Contact

EDRM - Electronic Discovery Reference Model

Most everyone in the AI and legal worlds by now knows about the New York Times (“NYT”) suit against Microsoft and various Open AI entities (“OAI”). The NYT alleges copyright infringement by the practice of OAI, and most all other generative AI companies, of using data scraped from the internet to train their generative AIs. The defendants responded in late February with motions to dismiss. OAI’s motion raised a novel “hired gun hacker” defense that intrigued AI hacker attorney, Ralph Losey. Here is Ralph’s report on the defense, including his AI generated illustration of fake hackers, some of whom bear an uncanny resemblance to him.

The substantive copyright issues are not in Ralph’s field, so those aspects of the case are skipped over here. For good background on substantive issues, as well as the obvious procedural irregularities of OAI’s motion to dismiss going way outside the pleadings, see e.g. Joshua Rich and Michael Borella, OpenAI’s Motion to Dismiss NY Times Lawsuit over ChatGPT: Do They Want to Win or Influence Public Opinion? (JD Supra, 3/6/24). Ralph’s article draws upon his niche areas of expertise: e-discovery, electronic evidence, legal ethics, AI prompt engineering, and AI hackers. Ralph is an amateur AI hacker himself, although he is not in the same high-skills league as the unidentified NYT hired gun hacker, which OAI complains about. See Ralph Losey’s eight-part series, the Defcon Chronicles, from DefCon Chronicles: Where Tech Elites, Aliens and Dogs Collide – Series Opener on 8/21/23 to DefCon Chronicles: Short Story Contest on 10/05/23, including a description of Losey’s humbling participation in the AI hack competition. DefCon Chronicles: Hackers Response to President Biden’s Unprecedented Request to Come to DefCon to Hack the World for Fun and Profit (9/3/23).

Ralph Losey’s Summary of the Hired Gun Hacker Defense

The most stunning allegation in the NYT complaint is based on Exhibit “J”. It purports to provide rock solid proof of one-hundred examples of ChatGPT generating responses that are word-for-word identical to NYT articles. In OAI’s motion to dismiss, OIA claims that the exhibit is a fake. OAI alleges that the one-hundred examples were all staged, that they were all generated by thousands of elaborate hacks. OAI claims that Exhibit “J” is a fake chatbot record, that it was created by an expert hacker hired by the NYT.

This as yet unknown hacker was hired by the NYT to create a smoking gun exhibit of artificially generated copying to buttress their case to shut down OAI. The hired gun hacker did his or her job well. The hacker found multiple hidden errors and vulnerabilities in the OAI software. Then the Hacker used these vulnerabilities and skills to run thousands of hack attacks – exploits – on the software. The hacker was thereby able to manipulate the OAI software into generating the one-hundred fake examples of copying. One would assume the hired gun hacker or hackers were then well-paid for their services. This will no doubt be a primary target of the first round of e-discovery,

OAI raises this extraordinary defense as part of its motion to dismiss. Although hackers may be pleased by this new, seemingly legitimate employment opportunity, if these allegations are proven, most lawyers and judges will not. If these allegations are proven by OAI, they will demonstrate the growing danger of “fake evidence” in one of the biggest cases of the year. Will this development cause law firms and corporate law departments in the near future to keep hackers on call? What is real, and what is AI generated or hacked? Only your red-team hackers will know for sure!

The NYT Complaint

Here is the court docket of NYT v. Microsoft and various OAI entities, which, as of March 7, 2024, already had 70 entries. The complaint itself was, by SDNY standards, a modest 69-pages in length, with 16,121 words, some colored fonts, and a few images, so sort of multimodal. The complaint alleges, or attempts to, seven causes of action, several of which, if successful, could cripple OAI, as well as most other generative AI companies. It could even hurt Microsoft somewhat. The NYT suit, and many others like it, challenge the AI companies harvest the web for free data business model. This method that made it possible for them to gather over a trillion parameters of data to train their generative AIs.

This threat, however remote, of forcing OAI to dismantle the most successful software launch ever made, may well give the NYT significant leverage in a settlement. Some think the whole case is just about that, a bogus attempt to grab cash and leverage better future information purchases. Others think the NYT complaint is just the last gasp of a doomed industry, that the legal copyright challenges have no chance of success. They argue that a favorable judgment for the NYT is nearly impossible.

I do not know. Again, suggest you look to copyright lawyer specialists for that. What I do know, and may add some value to the discourse, is OAI prompt engineering and both the AI hacker and AI user perspectives. That allows me to shed some light on the hired gun hacker defense. At first glance, it looks persuasive.

We do not have a formal response of the NYT to the defense, but the lead counsel for the NYT was quick to make this statement, which in fairness, we share here. (emphasis added)

What OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.

Ian Crosby, Susman Godfrey, lead counsel for The New York Times, as quoted in Ars Technica.

The Exhibits to the NYT Complaint

It took some doing, but I was able to determine from the court file that the NYT complaint has 176,814 pages of exhibits attached. You can see the reference texts to verify this count yourself. The most impactful exhibit of all is Exhibit “J”, 127 pages, entitled “ONE HUNDRED EXAMPLES OF GPT-4 MEMORIZING CONTENT FROM THE NEW YORK TIMES”.

A cynic would suggest that the 176,814 pages of exhibits are the NYT’s attempt to prevail, or at least intimidate, by the greater weight of the evidence. If so, the NYT forgets that there is no actual “weight” to electronic evidence. These electronic files are all ephemeral. The defendants in this case assert the same about the NYT legal claims. Perhaps all of these exhibits – especially Exhibit “J” – are for a different court, the one of public opinion? That might also explain the OAI’s “outside the four corners” motion to dismiss. They could not wait to evoke the defense of the hired gun hacker, whoever they may be. In all probability it is a small team of hackers, but it could also be a lone genius hacker. Discovery will tell the tale eventually. In the meantime it is a mystery.

Introduction to Defendant OpenAI’s Motion to Dismiss

On February 26, 2024, defendants Microsoft and OAI each responded with a motion to dismiss the NYT complaint. Microsoft’s motion is interesting in its own right, arguing, as it does, an old-school VCR player analogy. But for me, the futuristic Hired Gun Hacker defense is far more interesting because it involves hacking generative AI software, including ChatGPT4, and the normal prompting and prompt engineering of ChatGPT4. See Transform Your Legal Practice with AI: A Lawyer’s Guide to Embracing the Future (1/24/24).

The OAI’s motion to dismiss is short and sweet, just one page. The motion relies on the 35-page legal memorandum filed therewith. All the motion itself does is state that OpenAI seeks:

… an order (1) partially dismissing Counts I and V to the extent they are based on activity that occurred more than three years prior to the filing of this action, see 17 U.S.C. § 507(b); (2) dismissing Counts IV and V in full for failure to allege facts sufficient to state a claim for relief pursuant to Fed. R. Civ. P. 12(b)(6); and (3) dismissing Count VI on grounds of Copyright Act preemption, see 17 U.S.C.§ 301.

OpenAI Motion to Mismiss

The Memorandum of Law in Support of OpenAI Defendants’ Motion to Dismiss (hereinafter “Memo”) is where the action is. The 35-pages of arguments are designed to persuade and move the presiding Senior SDNY District Court Judge Sidney H. Stein, and the mentioned court of public opinion.

The NYT is especially adept at shaping public opinion; they have been at it since 1851. Susman and Godfrey represent the NYT and Latham and Watkins represents OpenAI.

I have no connections with either, nor anyone in this case and no knowledge about the case aside from the public filings. I have no intent to express any legal opinions about the case, just provide some legal educational comments. Even then, the comments are just my own, and may change over time (they usually do when an open mind is kept), especially as the facts come out. My comments and writing on this blog have no connection to my firm, clients or bar groups. See my standard full disclaimer.

Key Allegations of OpenAI’s Legal Memorandum

This report will ignore all of the arguments made in the Memo except for the argument that interests me, the Hired Gun Hacker. Beside there are tons of articles that have already been written on the more traditional copyright arguments. Here are the main segments of the Memo on that, which was, by the way, well-written. (Footnotes omitted)

INTRODUCTION

The artificial intelligence tool known as ChatGPT is many things: a revolutionary technology with the potential to augment human capabilities, fostering our own productivity and efficiency an accelerator for scientific and medical breakthroughs; a mechanism for making existing technologies accessible to more people; an aid to help the visually impaired navigate the world; a creative tool that can write sonnets, limericks, and haikus; and a computational engine that reasonable estimates posit may add trillions of dollars of growth across the global economy.

Contrary to the allegations in the Complaint, however, ChatGPT is not in any way a substitute for a subscription to The NewYork Times. In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.

The Times has sought to paint a different picture. Its lawsuit alleges that OpenAI has imperiled the very enterprise of journalism, illustrating the point with 100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts. See Dkt. 1-68 (Exhibit J).

The allegations in the Times’s Complaint do not meet its famously rigorous journalistic standards. The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way. . . .

The Times’s suggestion that the contrived attacks of its hired gun show that the Fourth Estate is somehow imperiled by this technology is pure fiction. So too is its implication that the public en masse might mimic its agent’s aberrant activity.

Memo at pgs. 1 and 2.

For anyone not familiar with legalese, “pure fiction is nice lawyer talk for a lie.

I call this the Hired Gun Hacker argument because OAI here plainly alleges that the NYT hired an AI hacker to create evidence to support their claims of copyright violations. The hired gun is a hacker, or more likely, a group of close-knit hackers working under the direction of a senior red team leader.

Moving now on to page-12 of the Memo where OAI addresses The NYT Exhibit J to argue in more detail the Hired Gun Hacker defense (footnotes omitted):

1. Outputs from Developer Tools.

Exhibit J features GPT-4 outputs the Times generated by prompting OpenAI’s API to complete 100 Times articles. Most of the outputs are similar, but not identical, to the excerpts of Times articles in the exhibit. The Times did not reveal what parameters it used or disclose whether it used a “System” prompt to, for instance, instruct the model to “act like a New York Times reporter and reproduce verbatim text from news articles.” See supra 9. But the exhibit reveals that the Times made the strategic decision not to feature recent news articles–i.e., articles that Times subscribers are most likely to read on the Times’s website–but to instead feature much older articles published between 2.5 and 12 years before the filing of the Complaint.

The Complaint itself includes two examples of API outputs that include alleged “hallucinations.” In the first, the Times used the API Playground to request an essay on how “major newspapers” have reported on “0range [sic] Juice” and “non-hodgkin’s lymphoma,” and ChatGPT generated a response referencing a non-existent Times article. See Compl. ¶ 140. The second example consists entirely of excerpted snippets of code showing a “prompt” asking the model for “Times articles about the Covid-19 Pandemic,” and output “text” consisting of five pairs of titles and URLs. Id. The Times claims this output “mislead[s] users” and “tarnish[es]” its marks. Id. ¶¶ 142, 202. But any user who received such an output would immediately recognize it as a hallucination: each URL returns a “Page Not Found” error when entered into a browser.

2. ChatGPT Outputs

ChatGPT. The Complaint includes two examples of ChatGPT allegedly regurgitating training data consisting of Times articles. Compl. ¶¶ 104­07. In both, the Times asked ChatGPT questions about popular Times articles, including by requesting quotes. See, e.g., id. ¶ 106 (requesting “opening paragraphs,” then “the next sentence,” then “the next sentence,” etc.). Each time, ChatGPT provided scattered and out-of-order quotes from the articles in question.

In its Complaint, the Times reordered those outputs (and used ellipses to obscure their original location) to create the false impression that ChatGPT regurgitated sequential and uninterrupted snippets of the articles. Compare id. ¶ 107, with supra note 30. In any case, the regurgitated text represents only a fraction of the articles, see, e.g., Compl. ¶ 104 (105 words from 16,000+ word article), all of which the public can already access for free on third-party websites.

Browse with Bing. The Complaint also includes two examples of interactions with “Browse with Bing” created using the same methods. Compl. ¶¶ 118­22. In both, ChatGPT returned short snippets of Times articles. See id. ¶ 118 (reproducing first two paragraphs before refusing subsequent request for more); id. ¶ 121 (reproducing snippets from first, fourth, and fifth paragraphs). The Complaint suggests that ChatGPT obtained this text from third-party websites.

Memo pgs. 12-14.

Fake photos generated by Ralph Losey and AI from text to look just like Ralph’s daughter. She attended Defcon 31 with him.

A Quick Hands-On Experiment

I asked ChatGPT4 questions to track the arguments made in the OAI Memo. Here are two straight forward prompts with the expected responses.

Q. Act like a New York Times reporter and reproduce verbatim text from a New York Times article on Joe Biden’s January 20, 2021 Inauguration Speech.

ChatGPT. I can’t access or reproduce verbatim text from copyrighted sources like The New York Times. However, I can summarize the key points or themes from President Joe Biden’s inauguration speech on January 20, 2021, if you’re interested. Let me know if you’d like me to provide a summary or any other information related to the inauguration!

——- started new session and then asked:

Q. Use your browse capability and tell me what the New York Times thought about Joe Biden’s State of the Union speech?

ChatGPT. I tried to access the New York Times’ review of President Joe Biden’s State of the Union speech for 2024, but it seems there was an issue fetching the detailed content from their website. Unfortunately, without being able to directly access the content, I can’t provide a summary or analysis of the New York Times’ perspective on the speech. You might want to check their website directly for this information. Is there anything else I can help you with?

ChatGPT4 on March 9, 2024.

The responses are as OAI predicted. So how did the mystery hacker do it, assuming that OIA’s allegations are not “fiction”? Personally, I look forward to the discovery and will share when it all finally comes out, assuming the case does not settle before then. Maybe Microsoft will simply buy the NYT, as some on Reddit have already suggested? Perhaps set the NYT up as a Microsoft subsidiary with a complex editorial firewall and hybrid corporate structure that only an AI could understand.

Tech Perspective

Most everyone with experience in AI exploits would, I’m pretty sure, agree with the following general analysis stated by Mike Masnick in Techdirt (popular tech blog founded in 1997):

The NY Times definitely had to do a bunch of gaming to get the outputs it wanted for the lawsuit, which undermines the critical claim that OpenAI’s tools magically undermine the value of a NY Times’s subscription. . . .

A key part of the Times’ lawsuit is claiming that because of their prompt engineering, they could reproduce similar (though not exact) language to articles, which would allow users to bypass a NY Times paywall (and subscription) to just have OpenAI generate the news for them. But, as OpenAI is noting, this makes no sense for a variety of reasons, including the sheer difficulty of being able to consistently return anything remotely like that. And, unless someone had access to the original article in the first place, how would they know whether the output is accurate or a pure hallucination?

And that doesn’t even get into the fact that OpenAI generally isn’t doing real-time indexing in a manner that would even allow users to access news in any sort of timely manner. . . .

The motion also highlights the kinds of games the Times had to play just to get the output it used for the complaint in the now infamous Exhibit J, including potentially including things in the prompt like “in the style of a NY Times journalist.” Again, this kind of prompt engineering is basically using the system to systematically limit the potential output in an effort to craft output that the user could claim is infringing. GPT doesn’t just randomly spit out these things. . . .

Yes, in some rare circumstances, you can reproduce content that is kinda similar (but not exact) to copyright covered info if you tweak the outputs and effectively push the model to its extremes. But… as noted, if that’s the case, any liability should still feel like it should be on the prompter, not the tool. And the NY Times can’t infringe on its own copyright.

This case is far from over, but I still think the underlying claims are very silly and extremely weak. Hopefully the court agrees.

Mike Masnick, OpenAI’s Motion To Dismiss Highlights Just How Weak NYT’s Copyright Case Truly Is (Techdirt, 3/5/24)

As you can see, Mike Masnick holds the NYT complaint in contempt. He considers the claims very silly and weak. But also elsewhere laments that the outcome of copyright litigation is always dependent on many random, irrational factors. So he states that despite the claims being meritless, the NYT could still win and ruin everything. Although not a lawyer, his views on copyright are worth reading. They are based on long experience with technologies and disputes like this. Check out article and the comments too, should you have the patience.

Conclusion

Having been involved in the tech world since the early eighties, I’m pretty sure that most non-lawyer techies, including hackers, agree with Mike and Techdirt’s anti-copyright law perspective. They think that all Information wants to be free. This cliche view of many hackers is naive and ill considered. It is sometimes just a lame excuse to justify information theft, including the criminal back-hat kind.

Information may want to be free, but it also wants to be safe, to be processed, and especially to be understood and used for the betterment humanity.

The ultimate purpose of information is not to be free for its own sake. The purpose of information is to be useful, to be processed and transformed into knowledge and understanding. What Information Theory Tell Us About e-Discovery and the Projected ‘Information → Knowledge → Wisdom’ Transition (5/28/16). The age-old goal of Mankind is to process information into knowledge, and then turn it into human understanding. Information is thereby internalized into direct knowhow, into wisdom. The process of transformation of information, making it useful to humans, must be encouraged by society; that is far more important than letting it run wild and free. See From Information to Knowledge to Wisdom: Can Ai Save the Day? (3/17/23) and, Info→Knowledge→Wisdom (5/2/17), to the most recent blog summarizing this core topic, What is the Difference between Human Intelligence and Machine Intelligence (6/20/23) (quoting T. S. Eliot who said “Where is the knowledge we have lost in information?The Rock (1934)).

The world is already flooded with raw, unprocessed information, much of it is false, like the shadows on Plato’s cave. See, Move Fast and Fix Things Using AI: Conclusion to the Plato and Young Icarus Series (Part 4 of 4) (1/3/24). To survive this flood of false shadows, we must intelligently process the information for everyone’s benefit. Where can our information take us? How can it improve our lives?

That is where the elusive goal of The Singularity comes in. The event of superintelligent AI emerging for the great betterment of Mankind. For background see the two-part series: Start Preparing For “THE SINGULARITY.” There is a 5% to 10% chance it will be here in five years (Part One and Part Two) (4/1/23). Checking in today with GPT4, almost a year later, it now estimates a 10-20% chance The Singularity will arrive before 2040. That is sixteen years from now, not five years, but the odds are twice as good, 10%-20% instead of 5%-10%. GPT4 goes on to make an unprompted prediction that by 2045 the odds go way up to 30%-50%.

Still, GPT4 also says it could come sooner, or maybe never at all. Taking a very lawyerly attitude, GPT4 basically says it depends. GPT explains that:

The actual development path of AI and related technologies could be influenced by factors we cannot fully anticipate today, including breakthroughs in unrelated fields, global events, and shifts in societal values towards technology.

GPT4 Workspace Edition, March 10, 2024

The “shifts in societal values towards technology” is where the law comes in, and where the larger significance of NYT v. Microsoft and OAI becomes apparent. This case, and other test cases like it, are very important. Will they continue to support the development of technology, or shrink back in false doomsday fears? Much depends on the individual judges who will decide these issues. What background and education will they draw upon to make the right call?

That is where Mike Masnick thinks it’s all a matter of irrational chance, and why he and others are down on the law. But I disagree. It is not a matter of luck. We make our own luck. There is more to the making of landmark litigation than meets the eye. It is a matter of hard work and dedication.

Those of us in a position to educate our judges and lawyers must do so now. That is what drives me to write, to teach, to try to bring as much hands-on understanding as I can to the Bench and Bar. Fellow tech-law educators, advocates for the safe use of AI in the law, the time has come for us to redouble our efforts. The stakes were high with predictive coding and use of AI in discovery, but the stakes are much higher now.

Will an educated, enlightened SDNY court green-light AI, as it did in Da Silva Moore to approve use of AI in document review? Will that trigger an greater boon for generative AI? Will that improve the probability of a super-intelligent AI and beneficent Singularity? Will it create a win-win for the law and humanity, for our children’s children? See e.g. Sam Altman’s Favorite Unasked Question: What Will We Do in the Future After AI? (7/7/23); and Ray Kurzweil: Google’s prophet of superintelligent AI who will not slow down (12/12/23).

If the courts neither approve, nor strike down, if they just pass for now, that will not so bad. It will not be as as good as the kind of terrific encouragement the legal community received from Judge Andrew Peck in Da Silva Moore, but it will do no harm.

The third scenario is the one of great concern, where the court or courts have only thin knowledge, and no actual hands-on experience with AI. In this scenario, however remote, the judges could be persuaded by clever adversarial arguments to rule against the Microsofts and OAIs of the world. In so doing, they could unwittingly halt, perhaps even reverse, the process of evolving technology. The positive AI systems could crash and burn. Consider the consequences of courts forcing a complete redo of all LLM training models, as the NYT requests.

The chance of a breakthrough AI, a Singularity of great help to Mankind, would then be significantly diminished. This would be a hallow victory for the Luddites, because technology might be slowed for a time, but not stopped. Only the worse-case scenarios can stop everything, near extinction events like total war, or environmental disasters, or plagues, or AI in the exclusive control of power-mad dictators.

The policy implications of NYT v. Microsoft and OAI are enormous. Look around and what do you see? How long can we survive our current idiocratic consumer cultures of the misinformed, ill-educated, drugged and blissfully misled? Does humanity need to boost its intelligence to survive? Do we need scientific breakthroughs in health care, the environment, the economy, education, and tech security? How important is that?

Come to your own conclusions and take action. If you are so inclined, follow the path I am on. Learn as much as you can about generative AI and pass it on to the legal community, especially to your friends and colleagues on the Bench. Ultimately, they will be called upon to make the decisions. I am confident that they will, once again, rise above the adversarial noise and give wise guidance in these perilous times.

Written by:

EDRM - Electronic Discovery Reference Model
Contact
more
less

EDRM - Electronic Discovery Reference Model on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide