Deduplication and Discovery: The Smoking Gun in the Machine

WINSTON

“Wipe up all those little pieces of brains and skull”

From Pulp Fiction, screenplay by Quentin Tarantino and Roger Avary

Deduplication—the process of removing identical or near-identical content from AI training data—is a critical yet often overlooked indicator that AI platforms actively monitor and curate their training sets. This is the kind of process that one would expect given the kind of “scrape, ready, aim” business practices that seems precisely the approach of AI platforms that have ready access to large amounts of fairly high quality data from users of other products placed into commerce by business affiliates or confederates of the AI platforms.

For example, Google Gemini could have access to gmail, YouTube, at least “publicly available” Google Docs, Google Translate, or Google for Education, and then of course one of the great scams of all time, Google Books. Microsoft uses Bing searches, MSN browsing, the consumer Copilot experience, and ad interactions. Amazon uses Alexa prompts, Facebook uses “public” posts and so on.

This kind of hoovering up of indiscriminate amounts of “data” in the form of your baby pictures posted on Facebook and your user generated content on YouTube is bound to produce duplicates. After all, how may users have posted their favorite Billie Eilish or Taylor Swift music video. AI doesn’t need 10000 versions of “Shake it Off” they probably just need the official video. Enter deduplication–which by definition means the platform knows what it has scraped and also knows what it wants to get rid of.

“Get rid of” is a relative concept. In many systems—particularly in storage environments like backup servers or object stores—deduplication means keeping only one physical copy of a file. Any other instances of that data don’t get stored again; instead, they’re represented by pointers to the original copy. This approach, known as inline deduplication, happens in real time and minimizes storage waste without actually deleting anything of functional value. It requires knowing what you have, knowing you have more than one version of the same thing, and being able to tell the system where to look to find the “original” copy without disturbing the process and burning compute inefficiently.

In other cases, such as post-process deduplication, the system stores data initially, then later scans for and eliminates redundancies. Again, the AI platform knows there are two or more versions of the same thing, say the book Being and Nothingness, knows where to find the copies and has been trained to keep only one version. Even here, the duplicates may not be permanently erased—they might be archived, versioned, or logged for auditing, compliance, or reconstruction purposes.

In AI training contexts, deduplication usually means removing redundant examples from the training set to avoid copyright risk. The duplicate content may be discarded from the training pipeline but often isn’t destroyed. Instead, AI companies may retain it in a separate filtered corpus or keep hashed fingerprints to ensure future models don’t retrain on the same material unknowingly.

So they know what they have, and likely know where it came from. They just don’t want to tell any plaintiffs.

Ultimately, deduplication is less about destruction and more about optimization. It’s a way to reduce noise, save resources, and improve performance—while still allowing systems to track, reference, or even rehydrate the original data if needed.

Its existence directly undermines claims that companies are unaware of which copyrighted works were ingested. Indeed, it only makes sense that one of the hidden consequences of the indiscriminate scraping that underpins large-scale AI training is the proliferation of duplicated data. Web crawlers ingest everything they can access—news articles republished across syndicates, forum posts echoed in aggregation sites, Wikipedia mirrors, boilerplate license terms, spammy SEO farms repeating the same language over and over. Without any filtering, this avalanche of redundant content floods the training pipeline.

This is where deduplication becomes not just useful, but essential. It’s the cleanup crew after a massive data land grab. The more messy and indiscriminate the scraping, the more aggressively the model must filter for quality, relevance, and uniqueness to avoid training inefficiencies or—worse—model behaviors that are skewed by repetition. If a model sees the same phrase or opinion thousands of times, it might assume it’s authoritative or universally accepted, even if it’s just a meme bouncing around low-quality content farms.

Deduplication is sort of the Winston Wolf of AI. And if the cleaner shows up, somebody had to order the cleanup. It is a direct response to the excesses of indiscriminate scraping. It’s both a technical fix and a quiet admission that the underlying data collection strategy is, by design, uncontrolled. But while the scraping may be uncontrolled to get copies of as much of your data has they can lay hands on, even by cleverly changing their terms of use boilerplate so they can do all this under the effluvia of legality, they send in the cleaner to take care of the crime scene.

So to summarize: To deduplicate, platforms must identify content-level matches (e.g., multiple copies of Being and Nothingness by Jean-Paul Sartre). This process requires tools that compare, fingerprint, or embed full documents—meaning the content is readable and classifiable–and, oh, yes, discoverable.

Platforms may choose the ‘cleanest’ copy to keep, showing knowledge and active decision-making about which version of a copyrighted work is retained. And–big finish–removing duplicates only makes sense if operators know which datasets they scraped and what those datasets contain.

Drilling down on a platform’s deduplication tools and practices may prove up knowledge and intent to a precise degree—contradicting arguments of plausible deniability in litigation. Johnny ate the cookies isn’t going to fly. There’s a market clearing level of record keeping necessary for deduping to work at all, so it’s likely that there are internal deduplication logs or tooling pipelines that are discoverable.

When AI platforms object to discovery about deduplication, plaintiffs can often overcome those objections by narrowing their focus. Rather than requesting broad details about how a model deduplicates its entire training set, plaintiffs should ask a simple, specific question: Were any of these known works—identified by title or author—deduplicated or excluded from training?

This approach avoids objections about overbreadth or burden. It reframes discovery as a factual inquiry, not a technical deep dive. If the platform claims the data was not retained, plaintiffs can ask for existing artifacts—like hash filters, logs, or manifests—or seek a sworn statement explaining the loss and when it occurred. That, in turn, opens the door to potential spoliation arguments.

If trade secrets are cited, plaintiffs can propose a protective order, limiting access to outside counsel or experts like we’ve done 100,000 times before in other cases. And if the defendant claims “duplicate” is too vague, plaintiffs can define it functionally—as content that’s identical or substantially similar, by hash, tokens, or vectors.

Most importantly, deduplication is relevant. If a platform identified a plaintiff’s work and trained on it anyway, that speaks to volitional use, copying, and lack of care—key issues in copyright and fair use analysis. And if they lied about it, particularly to the court—Helloooooo Harper & Row. Discovery requests that are focused, tailored, and anchored in specific works stand a far better chance of surviving objections and yielding meaningful evidence which hopefully will be useful and lead to other positive results.

Creators Rally Behind Cyril Vetter’s Termination Rights Case in the Fifth Circuit

Songwriter and publisher Cyril Vetter is at the center of a high-stakes copyright case over his song “Double Shot of My Baby’s Love” with massive implications for authors’ termination rights under U.S. law. His challenge to Resnik Music Group has reached the Fifth Circuit Court of Appeals, and creators across the country are showing up in force—with a wave of amicus briefs filed in support including Artist Rights Institute.  Let’s consider the case on appeal.

At the heart of Vetter’s case is a crucial question: When a U.S. author signs a U.S. contract governed by U.S. law and then later the author (or the author’s heirs) invokes their 35-year termination right under Sections 203 and 304 of the U.S. Copyright Act, does that termination recover only U.S. rights (the conventional wisdom)—or the entire copyright, including worldwide rights?  Vetter argued for the worldwide rights at trial.  And the trial judge agreed over strenuous objections by the music publisher opposing Cyril.

Judge Shelly Dick of the U.S. District Court for the Middle District of Louisiana agreed. Her ruling made clear that a grant of worldwide rights under a U.S. contract is subject to U.S. termination. To hold otherwise would defeat the statute’s purpose which seems obvious.

I’ve known Vetter’s counsel Tim Kappel since he was a law student and have followed this case closely. Tim built a strong record in the District Court and secured a win against tough odds. MTP readers may recall our interviews with him about the case, which attracted considerable attention. Tim’s work with Cyril has energized a creator community long skeptical of the industry’s ‘U.S. rights only’ narrative—a narrative more tradition than law, an artifact of smoke filled rooms and backroom lawyers.

The Artist Rights Institute (David Lowery, Nikki Rowling, and Chris Castle), along with allies including Abby North (daughter-in-law of the late film composer Alex North), Blake Morgan (#IRespectMusic), and Angela Rose White (daughter of the late television composer and music director David Rose), filed a brief supporting Vetter. The message is simple: Congress did not grant a second bite at half the apple. Termination rights are meant to restore the full copyright—not just fragments.

As we explained in our brief, Vetter’s original grant of rights was typical: worldwide and perpetual, sometimes described as ‘throughout the universe.’ The idea that termination lets an author reclaim only U.S. rights—leaving the rest with the publisher—is both absurd and dangerous.

This case is a wake-up call. Artists shouldn’t belong to the  ‘torturable class’—doomed to accept one-sided deals as normal. Termination was Congress’s way of correcting those imbalances. Terminations are designed by Congress to give a second bite at the whole apple, not the half.

Stay tuned—we’ll spotlight more briefs soon. Until then, here’s ours for your review.

How Google’s “AI Overviews” Product Exposes a New Frontier in Copyright Infringement and Monopoly Abuse: Lessons from the Chegg Lawsuit

In February 2025, Chegg, Inc.—a Santa Clara education technology company—filed what I think will be a groundbreaking antitrust lawsuit against Google and Alphabet over Google’s use of “retrieval augmented generation” or “RAG.” Chegg alleges that the search monopolist’s new AI-powered search product, AI Overviews, is the latest iteration of its longstanding abuse of monopoly power.

The Chegg case may be the first major legal test of how RAG tools, like those powering Google’s AI search features, can be weaponized to maintain dominance in a core market—while gutting adjacent industries.

What Is at Stake?

Chegg’s case is more than a business dispute over search traffic. It’s a critical turning point in how regulators, courts, and the public understand Google’s dual role as:
– The gatekeeper of the web, and
– The competitor to every content publisher, educator, journalist, or creator whose material feeds its systems.

According to Chegg, Google’s AI Overviews scrapes and repackages publisher content—including Chegg’s proprietary educational explanations—into neatly summarized answers, which are then featured prominently at the top of search results. These AI responses provide zero compensation and little visibility for the original source, effectively diverting traffic and revenue from publishers who are still needed to produce the underlying content. Very Googley.

Chegg alleges it has experienced a 49% drop in non-subscriber traffic from Google searches, directly attributing the collapse to the introduction of AI Overviews. Google, meanwhile, offers its usual “What, Me Worry?” defense and insists its AI summaries enhance the user experience and are simply the next evolution of search—not a monopoly violation. Yeah, right, that’s the ticket.

But the implications go far beyond Chegg’s case.

Monopoly Abuse, Evolved for AI

The Chegg lawsuit revives a familiar pattern from Google’s past:

– In the 2017 Google Shopping case, the EU fined Google €2.42 billion for self-preferencing—boosting its own comparison shopping service in search while demoting rivals.
– In the U.S. DOJ monopoly case (2020–2024), a federal court found that Google illegally maintained its monopoly by locking in default search placement on mobile browsers and devices.

Now with AI Overviews, Google is not just favoring its own product in the search interface—it is repurposing the product of others to power that offering. And unlike traditional links, AI Overviews can satisfy a query without any click-through, undermining both the economic incentive to create content and the infrastructure of the open web.

Critically, publishers who have opted out of AI training via robots.txt or Google’s own tools like Google-Extended find that this does not block RAG-based uses in AI Overviews—highlighting a regulatory gap that Google exploits. This should come as no surprise given Google’s long history of loophole seeking arbitrage.

Implications Under EU Law

The European Union should take note. Article 102 of the Treaty on the Functioning of the European Union (TFEU) prohibits dominant firms from abusing their market position to distort competition. The same principles that justified the €2.42B Google Shopping fine and the 2018 €4.1B Android fine apply here:

– Leveraging dominance in general search to distort competition in education, journalism, and web publishing.
– Self-preferencing and vertical integration via AI systems that cannibalize independent businesses.
– Undermining effective consent mechanisms (like AI training opt-outs) to maintain data advantage.

Chegg’s case may be the canary in the coal mine for what’s to come globally as more AI systems become integrated into dominant platforms. Google’s strategy with AI Overviews represents not just feature innovation, but a structural shift in how monopolies operate: they no longer just exclude rivals—they absorb them.

A Revelatory Regulatory Moment

The Chegg v. Google case matters because it pushes antitrust law into the AI litigation arena. It challenges regulators to treat search-AI hybrids as more than novel tech. They are economic chokepoints that extend monopoly control through invisible algorithms and irresistible user interfaces.

Rights holders, US courts and the European Commission should watch closely: this is not just a copyright fight—it’s a competition law flashpoint.

How RAG Affects Different Media and Web Publishers

Note: RAG systems can use audiovisual content, but typically through textual intermediaries like transcripts, not by directly retrieving and analyzing raw audio/video files. But that could be next.

CategoryExamples of Rights HoldersHow RAG Uses the Content
Film Studios / ScriptwritersParamount, Amazon, DisneySummarizes plots, reviews, and character arcs (e.g., ‘What happens in Oppenheimer?’)
Music Publishers / SongwritersUniversal, Concord, Peer/Taylor Swift/Bob Dylan/Kendrick LamarDisplays lyrics, interpretations, and credits (e.g., ‘Meaning of Anti-Hero by Taylor Swift’)
News OrganizationsCNN, Reuters, BBCGenerates summaries from live news feeds (e.g., ‘What’s happening in Gaza today?’)
Book Publishers / AuthorsHarpersCollins, Hachette, Macmillan Synthesizes themes, summaries, and reviews (e.g., ‘Theme of Beloved by Toni Morrison’)
Gaming Studios / ReviewersGameFAQs, IGN, RedditExplains gameplay strategies using fan walkthroughs (e.g., ‘How to defeat Fire Giant in Elden Ring’)
Visual Artists / PhotojournalistsArtNet, Museum Sites, Personal PortfoliosExplains style and methods from exhibition texts and bios (e.g., ‘How does Banksy create his art?’)
Podcasters / Transcription ServicesPodcast transcripts, show notesPulls quotes and summaries from transcript databases (e.g., ‘What did Ezra Klein say about AI regulation?’)
Educational Publishers / EdTechKhan Academy, Chegg, PearsonDelivers step-by-step solutions and concept explanations (e.g., ‘Explain the Pythagorean Theorem’)
Science and Medical PublishersMayo Clinic, MedlinePlus, PubMedAnswers medical questions with clinical and scientific data (e.g., ‘Symptoms of lupus’)