Anthropic And Perplexity Mired In Copyright Cases

In recent weeks, there’s been a lot of attention on a case being handled by judge William Alsup of the U.S. District Court for the Northern District of California, in which a collection of class action litigants accused Anthropic of pirating their copyrighted works to train the LLM, and hurting their livelihoods with the information that was gleaned. The original plaintiffs, Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, are claiming that Anthropic actually used BitTorrent to get their works, and as the case moved forward, other plaintiffs appeared, including both other authors, and others with music copyrights.

Settling the Case

In reports Aug. 26, it’s clear that progress is being made to resolve this.

“Anthropic has reached a preliminary settlement in a class action lawsuit brought by a group of prominent authors, marking a major turn in of the most significant ongoing AI copyright lawsuits in history,” writes Kate Knibbs at Wired. “The move will allow Anthropic to avoid what could have been a financially devastating outcome in court.”

Specifically, Knibbs cited Anthropic’s use of a library called LibGen, which the judge found constituted piracy, and extrapolated how much that could have cost the defendant, writing:

“Statutory damages for this kind of piracy start at $750 per infringed work, according to US copyright law. Because the library of books amassed by Anthropic was thought to contain approximately 7 million works, the AI company was potentially facing court-imposed penalties amounting to billions, possibly more than $1 trillion dollars.”

She also quoted remarks by Santa Clara University law professor Edward Lee on the case.

“It’s a stunning turn of events, given how Anthropic was fighting tooth and nail in two courts in this case,” Lee said. “And the company recently hired a new trial team … But they had few defenses at trial, given how Judge Alsup ruled. So Anthropic was starting at the risk of statutory damages in ‘doomsday’ amounts.”

So that’s one prominent example of the struggle between content creators and AI companies, but it’s not the only one on the docket.

There are also new wrinkles in a case where multiple parties are suing AI firm Perplexity for using copyrighted content to feed its “answer engine.”

The Perplexity of Fair Use

News publishers are claiming that Perplexity’s use of a Retrieval Augmented Generation or RAG model constitutes copyright violation, and that the AI company has unfairly used their work, harming their business models. But to understand who’s bringing this case, you have to parse the hallways of a particular corporate labyrinth.

The Plaintiffs are “Dow Jones and NYP Holdings,” which publish the Wall Street Journal, as well as the New York Post. But both of these firms are apparently owned by a parent company called “News Corp.”

In any case, the publishers are outlining three components of the harms they attribute to Perplexity. All three of these are clearly documented in the latest filings to come out of the case, including a court opinion and order filed Aug. 25.

Triple Threat

First, Dow Jones and NYP Holdings is arguing that there’s a violation in Perplexity putting massive amounts of news into its RAG repository, the database that it uses to feed the LLM.

Next, the plaintiffs are claiming that some of the LLM’s results “contain full or partial verbatim reproductions of Plaintiffs’ copyrighted articles.” There’s anecdotal evidence of users getting Perplexity to spit out whole NYP articles verbatim, which doesn’t look good for the defendant.

The third argument is also laid out well in the filing docs, where plantiffs allege that the LLM is “generat[ing] made-up text (hallucinations) in its outputs and attribut[ing] that text to Plaintiffs’ publications using Plaintiffs’ trademarks.”

In other words, the LLM is essentially putting words in the mouths of the plaintiffs, which sounds like potential defamation. The court document stipulates:

“Plaintiffs also provide photo examples of hallucinations generated by Perplexity, in which Perplexity ‘fabricated’ information not actually contained in the New York Post and Wall Street Journal articles that Perplexity cited. Plaintiffs claim that hallucinations such as these are ‘likely to cause confusion or mistake’ for consumers.”

It’s not hard to imagine that evidence like that could be very damaging, indeed.

Moving Forward

Where it seems to stand right now is that the court’s recent opinion allows the plaintiffs to move forward with the case. It can be interesting to consider how all of this works, practically, given that it’s a new kind of challenge regarding digital content that is, in so many ways, publicly available, despite any copyright.

In other words, if the content is on the Internet, and not paywalled, isn’t it fair to read it? And if the LLM “reads” it and “remembers” it, can’t the LLM provide that information to users? Isn’t that just like a human expert giving his or her research-based knowledge to a client? Is it the lodging of the data in the RAG database that’s a problem?

One argument would be that Perplexity is charging people to use the answer machine. But it seems strange to say that it’s out of bounds for the LLM to use publicly reported facts that are on the Internet for human readers. If someone reads something and charges for a lecture, is that a violation, too?

Frankly, it seems like there’s a sort of philosophical component to this, where we have to figure out free and fair use standards for the internet and for AI. Maybe these two cases, and others like them, are doing that for us.

Forbes