Judge Failla’s Opinion in Dow Jones v. Perplexity: RAG as Mechanism of Infringement

Judge Failla’s opinion in Dow Jones v. Perplexity doesn’t just keep the case alive—it frames RAG itself as the act of copying, and raises the specter of inducement liability under Grokster.

Although Judge Katherine Polk Failla’s August 21, 2025 opinion in Dow Jones & Co. v. Perplexity is technically a procedural ruling denying Perplexity’s motions to dismiss or transfer, Judge Failla offers an unusually candid window into how the Court may view the substance of the case. In particular, her treatment of retrieval-augmented generation (RAG) is striking: rather than describing it as Perplexity’s background plumbing, she identified it as the mechanism by which copyright infringement and trademark misattribution allegedly occur.  

Remember, Perplexity’s CEO described the company to Forbes as “It’s almost like Wikipedia and ChatGPT had a kid.” I’m still looking for that attribution under the Wikipedia Creative Commons license.

As readers may recall, I’ve been very interested in RAG as an open door for infringement actions, so naturally this discussion caught my eye.  So we’re all on the page, retrieval-augmented generation (RAG) uses a “vector database” to expand an AI system’s knowledge beyond what is locked in its training data, including recent news sources for example. 

When you prompt a RAG-enabled model, it first searches the database for context, then weaves that information into its generated answer. This architecture makes outputs more accurate, current, and domain-specific, but also raises questions about copyright, data governance, and intentional use of third-party content mostly because RAG may rely on information outside of its training data.  Like if I queried “single bullet theory” the AI might have a copy of the Warren Commission report, but would need to go out on the web for the latest declassified JFK materials or news reports about those materials to give a complete answer.

You can also think of Google Search or Bing as a kind of RAG index—and you can see how that would give search engines a big leg up in the AI race, even though none of their various safe harbors, Creative Commons licenses, Google Books or direct licenses were for this RAG purpose.  So there’s that.

Judge Failla’s RAG Analysis

As Judge Failla explained, Perplexity’s system “relies on a retrieval-augmented generation (‘RAG’) database, comprised of ‘content from original sources,’ to provide answers to users,” with the indices “comprised of content that [Perplexity] want[s] to use as source material from which to generate the ‘answers’ to user prompts and questions.’” The model then “repackages the original, indexed content in written responses … to users,” with the RAG technology “tell[ing] the LLM exactly which original content to turn into its ‘answer.’” Or as another judge once said, “One who distributes a device with the object of promoting its use to infringe copyright, as shown by clear expression or other affirmative steps taken to foster infringement, going beyond mere distribution with knowledge of third-party action, is liable for the resulting acts of infringement by third parties using the device, regardless of the device’s lawful uses.” Or something like that.

On that basis, Judge Failla recognized Plaintiffs’ claim that infringement occurred at both ends of the process: “first, by ‘copying a massive amount of Plaintiffs’ copyrighted works as inputs into its RAG index’; second, by providing consumers with outputs that ‘contain full or partial verbatim reproductions of Plaintiffs’ copyrighted articles’; and third, by ‘generat[ing] made-up text (hallucinations) … attribut[ed] … to Plaintiffs’ publications using Plaintiffs’ trademarks.’” In her jurisdictional analysis, Judge Failla stressed that these “inputs are significant because they cause Defendant’s website to produce answers that are reproductions or detailed summaries of Plaintiffs’ copyrighted works,” thus tying the alleged misconduct directly to Perplexity’s business activities in New York although she was not making a substantive ruling in this instance.

What is RAG and Why It Matters

Retrieval-augmented generation is a method that pairs two steps: (1) retrieval of content from external databases or the open web, and (2) generation of a synthetic answer using a large language model. Instead of relying solely on the model’s pre-training, RAG systems point the model toward selected source material such as news articles, scientific papers, legal databases and instruct it to weave that content into an answer. 

From a user perspective, this can produce more accurate, up-to-date results. But from a legal perspective, the same pipeline can directly copy or closely paraphrase copyrighted material, often without attribution, and can even misattribute hallucinated text to legitimate sources. This dual role of RAG—retrieving copyrighted works as inputs and reproducing them as outputs—is exactly what made it central to Judge Failla’s opinion procedurally, but also may show where she is thinking substantively.

RAG in Frontier Labs

RAG is not a niche technique. It has become standard practice at nearly every frontier AI lab:

– OpenAI uses retrieval plug-ins and Bing integrations to ground ChatGPT answers.
– Anthropic deploys RAG pipelines in Claude for enterprise customers.
– Google DeepMind integrates RAG into Gemini and search-linked models.
– Meta builds retrieval into LLaMA applications and experimental assistants like Grok.
– Microsoft has made Copilot fundamentally a RAG product, pairing Bing with GPT.
– Cohere, Mistral, and other independents market RAG as a service layer for enterprises.

Why Dow Jones Matters Beyond Perplexity

Perplexity just happened to be first reported opinion as far as I know. The technical structure of its answer engine—indexing copyrighted content into a RAG system, then repackaging it for users—is not unique. It mirrors how the rest of the frontier labs are building their flagship products. What makes this case important is not that Perplexity is an outlier, but that it illustrates the legal vulnerability inherent in the RAG architecture itself.

Is RAG the Low-Hanging Fruit?

What makes this case so consequential is not just that Judge Failla recognized, at least for this ruling, that RAG is at least one mechanism of infringement, but that RAG cases may be easier to prove than disputes over model training inputs. Training claims often run into evidentiary hurdles: plaintiffs must show that their works were included in massive opaque training corpora, that those works influenced model parameters, and that the resulting outputs are “substantially similar.” That chain of proof can be complex and indirect.

By contrast, RAG systems operate in the open. They index specific copyrighted articles, feed them directly into a generation process, and sometimes output verbatim or near-verbatim passages. Plaintiffs can point to before-and-after evidence: the copyrighted article itself, the RAG index that ingested it, and the system’s generated output reproducing it. That may make proving copyright infringement far more straightforward to demonstrate than in a pure training case.

For that reason, Perplexity just happened to be first, but it will not be the last. Nearly every frontier lab such as OpenAI, Anthropic, Google, Meta, Microsoft is relying on RAG as the architecture of choice to ground their models. If RAG is the legal weak point, this opinion could mark the opening salvo in a much broader wave of litigation aimed at AI platforms, with courts treating RAG not as a technical curiosity but as a direct, provable conduit for infringement. 

And lurking in the background is a bigger question: is Grokster going to be Judge Failla’s roundhouse kick? That irony is delicious.  By highlighting how Perplexity (and the others) deliberately designed its system to ingest and repackage copyrighted works, the opinion sets the stage for a finding of intentionality that could make RAG the twenty-first-century version of inducement liability.

From Plutonium to Prompt Engineering: Big Tech’s Land Grab at America’s Nuclear Sites–and Who’s Paying for It?

In a twist of post–Cold War irony, the same federal sites that once forged the isotopes of nuclear deterrence are now poised to fuel the arms race of artificial intelligence under the leadership of Special Government Employee and Silicon Valley Viceroy David Sacks. Under a new Department of Energy (DOE) initiative, 16 legacy nuclear and lab sites — including Savannah River, Idaho National Lab, and Oak Ridge Tennessee — are being opened to private companies to host massive AI data centers. That’s right–Tennessee where David Sacks is riding roughshod over the ELVIS Act.

But as this techno-industrial alliance gathers steam, one question looms large: Who benefits — and how will the American public be compensated for leasing its nuclear commons to the world’s most powerful corporations? Spoiler alert: We won’t.

A New Model, But Not the Manhattan Project

This program is being billed in headlines as a “new Manhattan Project for AI.” But that comparison falls apart quickly. The original Manhattan Project was:
– Owned by the government
– Staffed by public scientists
– Built for collective defense

Today’s AI infrastructure effort is:
– Privately controlled
– Driven by monopolies and venture capital
– Structured to avoid transparency and public input
– Uses free leases on public land with private nuclear reactors

Call it the Manhattan Project in reverse — not national defense, but national defense capture.

The Art of the Deal: Who gets what?

What Big Tech Is Getting

– Access to federal land already zoned, secured, and wired
– Exemption from state and local permitting
– Bypass of grid congestion via nuclear-ready substations
– DOE’s help fast-tracking nuclear microreactors (SMRs)
– Potential sovereign AI training enclaves, shielded from export controls and oversight

And all of it is being made available to private companies called the “Frontier labs”: Microsoft, Oracle, Amazon, OpenAI, Anthropic, xAI — the very firms at the center of the AI race.

What the Taxpayer Gets (Maybe)

Despite this extraordinary access, almost nothing is disclosed about how the public is compensated. No known revenue-sharing models. No guaranteed public compute access. No equity. No royalties.

Land lease payments? Not disclosed. Probably none.
Local tax revenue? Minimal (federal lands exempt)
Infrastructure benefit sharing? Unclear or limited

It’s all being negotiated quietly, under vague promises of “national competitiveness.”

Why AI Labs Want DOE Sites

Frontier labs like OpenAI and Anthropic — and their cloud sponsors — need:
– Gigawatts of energy
– Secure compute environments
– Freedom from export rules and Freedom of Information Act requests
– Permitting shortcuts and national branding

The DOE sites offer all of that — plus built-in federal credibility. The same labs currently arguing in court that their training practices are “fair use” now claim they are defenders of democracy training AI on taxpayer-built land.

This Isn’t the Manhattan Project — It’s the Extraction Economy in a Lab Coat

The tech industry loves to invoke patriotism when it’s convenient — especially when demanding access to federal land, nuclear infrastructure, or diplomatic cover from the EU’s AI Act. But let’s be clear:

This isn’t the Manhattan Project. Or rather we should hope it isn’t because that one didn’t end well and still hasn’t.
It’s not public service.
It’s Big Tech lying about fair use, wrapped in an American flag — and for all we know, it might be the first time David Sacks ever saw one.

When companies like OpenAI and Microsoft claim they’re defending democracy while building proprietary systems on DOE nuclear land, we’re not just being gaslit — we’re being looted.

If the AI revolution is built on nationalizing risk and privatizing power, it’s time to ask whose country this still is — and who gets to turn off the lights.

Chronology: The Week in Review: Search is Ripping Us Off Again, this time for AI; #FreeJimmyLai; the MLC loves the MLC almost as much as the NMPA

In case you think that infringement of copyrights in AI training happens only to rock stars, photographers, record companies or movie studios, let me give you a personal anecdote that might cause you to reconsider. I asked Bing’s Copilot AI tool to list the deal points in a record producer agreement, and this is what I got back:

Copilot

Certainly! When negotiating a producer agreement, several key points should be considered. Here are some common negotiation points from the artist’s perspective:

  1. Scope of Engagement:
    1. The producer is generally engaged for one album project at a time.
    1. Clarify the scope of work, including the number of songs or recordings to be produced.
  2. Budget and Compensation:
    1. Determine the budget for the project (e.g., $10,000 in your case).
    1. Agree on the producer’s compensation, which may include an upfront fee, points on sales, or a combination.
  3. Ownership and Rights:
    1. Address ownership of the master recordings:
      1. Will the artist own and control the masters?
      1. Does the producer retain any rights?
  4. Credits and Royalties:
    1. Specify how the producer will be credited on releases.
    1. Discuss any royalties or backend points for the producer.
  5. Recording Schedule:
    1. Set a timeline for recording sessions.
    1. Define the two-week period for your project.
  6. Logistics:
    1. Determine where the recordings will take place.
    1. Outline the artist’s approval process.
    1. Define the producer’s deliverables.
  7. Termination Clause:
    1. Include provisions for termination if either party breaches the agreement.
    1. Address payment for work completed up to that point.

Remember that each negotiation is unique, and it’s essential to tailor the agreement to your specific needs and circumstances. Legal advice is recommended to ensure compliance with local laws and industry standards1 3 4.

Source(s)

1. An Artist’s Guide to Negotiating Record Producer Agreements – Medium

2. Negotiating a Music Producer Agreement: 7 Key Issues

3. 5 Things to Look for in a Music Producer Agreement – Lawyer Drummer

4. How to Negotiate: Steps, Phrases & Tips to Win the Deal – Selling Signals

Notice that Copilot lists “Source(s)” and that source #1 is an article from Medium.

My article.

I’ve written a number of articles about negotiating producer agreements and I can categorically say I haven’t given permission or been asked to give permission for any of those articles to be scraped and used to train AI. So how did it come to be in Copilot’s training materials?

My guess is that Medium (and probably a lot of other sites) were crawled by Bing’s spider and that copies of Medium pages were stored in Bing’s (or Microsoft’s) data centers. When Microsoft launched its Copilot AI tool, one could imagine that the temptation would have been strong to use those copies as part of the training tools for Copilot. In fact, the Verge reports that “companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.”

Are you surprised then that two of the biggest operators in the AI space are the search engine operators Google and Microsoft? This is another example of how Big Tech helps itself to your data and work product without you even knowing it’s happening. So now what? Now I know I’m being ripped off, and I’m wondering if Medium is in on it.

The Verge tells us:

The ability to download, store, organize, and query the modern internet gives any company or developer something like the world’s accumulated knowledge to work with. In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. 

Ya think?

#FreeJimmyLai

In case you were wondering if the Chinese Communist Party is serious about using TikTok to collect data as they please, Hong Kong publisher Jimmy Lai is a perfect example of what happens when a company refuses to cooperate. The CCP took over Jimmy’s Apple Daily newspaper and arrested him. Jimmy has been in prison for three years and has still not come to “trial” (whatever that means under the CCP).

Blake Morgan sounds off in Hypebot on how TikTok uses music as a honeypot to hook innocent users

The MLC Reup

The MLC gets a five year review of how they are doing. The Copyright Office conducts that review of both the MLC and the DLC. The public (that’s you) gets a chance to weigh in now because the MLC and the DLC filed their respective written statements about their respective awesomeness.

One of the most interesting exchanges happened between Graham Davies (who essentially is the public face of the DLC and is the head of the Digital Media Association) and the NMPA. Graham offered muted criticism of the MLC which irked the easily irked NMPA who offered a rather scathing response.

The most interesting thing about the exchange was that it was the NMPA–not the MLC–that responded to Graham. Tell you anything?

Chronology: The Week in Review, Eric Schmidt Spills on his “Bait” to UK PM, Musk on AI Training and other news

Elon Musk Calls Out AI Training

We’ve all heard the drivel coming from Silicon Valley that AI training is fair use. During his interview with Andrew Ross Sorkin at the DealBook conference, Elon Musk (who ought to know given his involvement with AI) said straight up that anyone who says AI doesn’t train on copyrights is lying.

The UK Government “Took the Bait”: Eric Schmidt Says the Quiet Part Out Loud on Biden AI Executive Order and Global Governance

There are a lot of moves being made in the US, UK and Europe right now that will affect copyright policy for at least a generation. Google’s past chair Eric Schmidt has been working behind the scenes for the last two years at least to establish US artificial intelligence policy. Those efforts produced the “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence“, the longest executive order in history. That EO was signed into effect by President Biden on October 30, so it’s done. (It is very unlikely that that EO was drafted entirely at Executive Branch agencies.)

You may ask, how exactly did this sweeping Executive Order come to pass? Who was behind it, because someone always is. As you will see in his own words, Eric Schmidt, Google and unnamed senior engineers from the existing AI platforms are quickly making the rule and essentially drafted the Executive Order that President Biden signed into law on October 30. And which was presented as what Mr. Schmidt calls “bait” to the UK government–which convened a global AI safety conference convened by His Excellency Rishi Sunak (the UK’s tech bro Prime Minister) that just happened to start on November 1, the day after President Biden signed the EO, at Bletchley Park in the UK (see Alan Turing). (See “Excited schoolboy Sunak gushes as mentor Musk warns of humanoid robot catastrophe.”)

Remember, an executive order is an administrative directive from the President of the United States that addresses the operations of the federal government, particularly the vast Executive Branch. In that sense, Executive Orders are anti-majoritarian and are as close to at least a royal decree or Executive Branch legislation as we get in the United States (see Separation of Powers, Federalist 47 and Montesquieu). Executive orders are not legislation; they require no approval from Congress, and Congress cannot simply overturn them.

So you can see if the special interests wanted to slide something by the people that was difficult to undo or difficult to pass in the People’s House…and based on Eric Schmidt’s recent interview with Mike Allen at the Axios AI+ (available here), this appears to be exactly what happened with the sweeping and vastly concerning AI Executive Order. I strongly recommend that you watch Mike Allen’s “interview” with Mr. Schmidt which fortunately is the first conversation in the rather long video of the entire event. I put “interview” in scare quotes because whatever it is, it isn’t the kind of interview that prompts probing questions that might put Mr. Schmidt on the spot. That’s understandable because Axios is selling a conference and you simply won’t get senior corporate executives to attend if you put them on the spot. Not a criticism, but understand that you have to find value for your time. Mr. Schmidt’s ego provides plenty of value; it just doesn’t come from the journalists.

Crucially, Congress is not involved in issuing an executive order. Congress may refuse to fund the subject of the EO which could make it difficult to give it effect as a practical matter but Congress cannot overturn an EO. Only a sitting U.S. President may overturn an existing executive order. In Mr. Schmidt’s interview at AI+, he tells us how all this regulatory activity happened:

The tech people along with myself have been meeting for about a year. The narrative goes something like this: We are moving well past regulatory or government understanding of what is possible, we accept that. [Remember the antecedent of “we” means Schmidt and “the tech people,” or more broadly the special interests, not you, me or the American people.].

Strangely…this is the first time that the senior leaders who are engineers have basically said that they want regulation, but we want it in the following ways…which as you know never works in Washington [unless you can write an Executive Order and get the President to sign it because you are the biggest corporation in commercial history].

There is a complete agreement that there are systems and scenarios that are dangerous. [Agreement by or with whom? No one asks.]. And in all of the big [AI platforms with which] you are familiar like GPT…all of them have groups that look at the guard rails [presumably internal groups of managers] and they put constraints on [their AI platform in their silo]. They say “thou shalt not talk about death, thou shall not talk about killing”. [Anthropic, which received a $300 million investment from Google] actually trained the model with its own constitution [see “Claude’s Constitution“] which they did not just write themselves, they hired a bunch of people [actually Claude’s Constitution was crowd sourced] to design a “constitution” for an AI, so it’s an interesting idea.

The problem is none of us believe this is strong enough….Our opinion at the moment is that the best path is to build some IPCC-like environment globally that allows accurate information of what is going on to the policy makers. [This is a step toward global governance for AI (and probably the Internet) through the United Nations. IPCC is the Intergovernmental Panel on Climate Change.]

So far we are on a win, the taste of winning is there.  If you look at the UK event which I was part of, the UK government took the bait, took the ideas, decided to lead, they’re very good at this,  and they came out with very sensible guidelines.  Because the US and UK have worked really well together—there’s a group within the National Security Council here that is particularly good at this, and they got it right, and that produced this EO which is I think is the longest EO in history, that says all aspects of our government are to be organized around this.

While Mr. Schmidt may say, aw shucks dictating the rules to the government never works in Washington, but of course that’s simply not true if you’re Google. In which case it’s always true and that’s how Mr. Schmidt got his EO and will now export it to other countries.

It’s not Just Google: Microsoft Is Getting into the Act on AI and Copyright

Be sure to read Joe Bambridge (Politico’s UK editor) on Microsoft’s moves in the UK. You have to love the “don’t make life too difficult for us” line–as in respecting copyright.

Google and New Mountain Capital Buy BMI: Now what?

Careful observers of the BMI sale were not led astray by BMI’s Thanksgiving week press release that was dutifully written up as news by most of the usual suspects except for the fabulous Music Business Worldwide and…ahem…us. You may think we’re making too much out of the Google investment through it’s CapitalG side fund, but judging by how much BMI tried to hide the investment, I’d say that Google’s post-sale involvement probably varies inversely to the buried lede. Not to mention the culture clash over ageism so common at Google–if you’re a BMI employee who is over 30 and didn’t go to Carnegie Mellon, good luck.

And songwriters? Get ready to jump if you need to.

Spotify Brings the Streaming Monopoly to Uruguay

After Uruguay was the first Latin American country to pass streaming remuneration laws to protect artists, Spotify threw its toys out of the pram and threatened to go home. Can we get that in writing? A Spotify exit would probably be the best thing that ever happened to increase local competition in a Spanish language country. Also, this legislation has been characterized as “equitable remuneration” which it really isn’t. It’s its own thing, see the paper I wrote for WIPO with economist Claudio Feijoo. Complete Music Update’s Chris Cook suggested that a likely result of Spotify paying the royalty would be that they would simply do a cram down with the labels on the next round of license negotiations. If that’s not prohibited in the statute, it should be, and it’s really not “paying twice for the same music” anyway. The streaming remuneration is compensation for the streamers use of and profit from the artists’ brand (both featured and nonfeatured), e.g., as stated in the International Covenant on Economic, Social and Cultural Rights and many other human rights documents:

The Covenant recognizes everyone’s right — as a human right–to the protection and the benefits from the protection of the moral and material interests derived from any scientific, literary or artistic production of which he or she is the author. This human right itself derives from the inherent dignity and worth of all persons.