The Sync Apocalypse Case Against Suno

Democracy Made Them Do It

The new Poseidon Wave Media LLC lawsuit against Suno may become another important fair use case in generative AI music because it goes straight at the weak point Judge Vince Chhabria identified in Kadrey v. Meta regarding books: market substitution and market dilution under factor four that can trump the overused “transformative” analysis.



In Kadrey, Judge Chhabria ruled for Meta on fair use, but he did not give AI companies a clean bill of health. Quite the opposite. He suggested that generative AI training may often fail fair use where plaintiffs build a real record showing that the model floods or dilutes the market for the plaintiffs’ works. Like what is happening in real life with synthetic music, and in particular with synthetic music produced using Suno.

Aside from being suspicious of a grown man voluntarily calling himself “Mikey”, there’s a lot to work with in the public statements of Suno CEO Mikey Shulman. In a widely panned venture capital podcast, “Mikey” argued that traditional music creation is too difficult and time-consuming for most people, claiming that “the majority of people don’t enjoy the majority of the time they spend making music.” He framed AI music generation as a way to democratize creativity by removing the need for years of practice or technical skill.

Yes, that’s right. He’s doing it for, like, democracy, you see. Just like Daniel Ek (who is currently occupying himself after Spotify with another autonomous weapon that again violates international treaties).

Shulman also acknowledged that using copyrighted works in AI training is effectively industry standard, stating that “every AI company does” copyright infringement when building generative AI systems. His comments triggered backlash from musicians, composers and industry observers who viewed the statements as dismissive of artistic labor and revealing about AI companies’ attitudes toward copyright and human creativity. We’ll come back to this bit.

When Did Noah Build the Ark? Before the flood….

But the flooding of markets using Suno is what makes Poseidon different than other cases I’ve seen so far, kind of like the eggshell skull case lawyers study on the first day of Torts. One could argue that Poseidon’s lawsuit against Suno resembles the classic “eggshell skull” rule because AI companies may be liable for the full downstream harm caused by training on copyrighted works even if they claim they did not anticipate the scale of damage. If Suno’s infringement helped create systems that flood markets and substitute for human creators, defendants take the creative marketplace “as they find it.” You could also find it reasonably foreseeable that if one AI lab’s executives knew that “everyone was doing it” the flip side is that “everyone” can cause a good deal of market harm to everyone else.

Plaintiff Poseidon Wave Media, the entity behind the instrumental duo The American Dollar, alleges that Suno copied and ingested 236 recordings and compositions covered by 164 copyright registrations. More importantly, Poseidon alleges that its licensing revenue has fallen by nearly 80% since Suno launched.

That is not a vibes-based fair use objection. That is a market-harm theory with a ledger attached. The plaintiffs still have to prove their case, but it sounds like a pretty good starting place.

The complaint targets the precise market most vulnerable to AI substitution: sync and production music. Cinematic instrumental catalogs are valuable because they supply mood, pacing, emotional texture, and audiovisual utility. Music supervisors are concerned that fully synthetic music undermines the basic trust and clearance infrastructure on which film, television and advertising music depends. They are often pitched an apparent “artist” that turned out to be AI-generated, raising immediate clearance and provenance concerns for music supervisors, their clients and E&O insurance carriers. AI-generated tracks are fundamentally incompatible with the human authorship and rights verification required for professional sync licensing. So in addition to the human cost, there’s a broader aspect to destroying the human market—synthetic music could flood the marketplace with unverifiable works, creating legal uncertainty and making it harder for supervisors to assess ownership, permissions and creative authenticity.

Generative AI does not have to spit out an identical American Dollar track to destroy the market for American Dollar licenses. It only has to produce infinite near-substitutes at lower cost, faster speed, and no meaningful bargaining friction. That is market dilution.

That is factor four. And that is happening at a devastating rate in our business.

The Sync Apocalypse Extends the Kadrey Theory

This is also why Poseidon extends the Kadrey analysis beyond books. In the book cases, market harm may appear more abstract. In sync music, the substitution pathway is far cleaner. The buyer has a practical production need. The AI output can satisfy that need if the music supervisor looks the other way, at least for a while, particularly for commercials, “source” music, other background uses. The original license disappears. Mikey wants you to believe that’s a good thing, because democracy.

Poseidon’s allegation that licensing income collapsed after Suno launched is therefore not just damages evidence. It may be the whole fair use fight.

Suno will likely argue transformation: the model learns from recordings to generate new outputs. But Kadrey already shows why transformation is not enough if factor four turns decisively against the defendant and the plaintiff’s lawyers put on the right case. Judge Chhabria made it clear that this observation applied broadly to all fair use cases: “Generative AI has the potential to flood the market with endless amounts of images, songs, articles, books, and more.” Kadrey v. Meta Platforms, Inc., No. 23-cv-03417-VC, slip op. at 1–2 (N.D. Cal. June 25, 2025).

That makes Poseidon dangerous for Suno. The complaint does not need to prove that every Suno output is a counterfeit. It needs to show that Suno used copyrighted works to build a machine that competes directly in the licensing market with those works it ripped off.

That is the sync apocalypse theory:

First, copy the catalog.
Then, train the machine.
Then, flood the licensing market with synthetic substitutes.
Then, tell the original musicians there is no market harm because the outputs are not exact copies. Because democracy demands it.

Factor four was built for this problem, even without the democracy part. And Poseidon may be the case that forces courts to say so. And as far as the democracy part goes, I think Mikey may have taken the wrong turn on his way to Collectivism class. In our legal tradition, there’s another idea that has far greater purchase:

“The right of property… [is] that sole and despotic dominion which one man claims and exercises… in total exclusion of the right of any other individual in the universe.”
— Sir William Blackstone, Commentaries on the Laws of England, Book II, ch. 1. 

Could Suno’s Executives Be Added Personally?

One question hovering over the Poseidon complaint is whether Suno’s executives and investors could eventually be added as individual defendants. What did they know and when did they know it?

In copyright cases, corporate officers can face personal liability where they personally participated in the infringement, directed it, authorized it, or had the right and ability to supervise the infringing conduct while receiving a financial benefit from it as we saw in a couple leading cases “All persons and corporations who participate in, exercise control over or benefit from an infringement are jointly and severally liable as copyright infringers.” Gershwin Publ’g Corp. v. Columbia Artists Mgmt., Inc., 443 F.2d 1159, 1162 (2d Cir. 1971); “One who distributes a device with the object of promoting its use to infringe copyright… is liable for the resulting acts of infringement by third parties.” MGM Studios Inc. v. Grokster, Ltd., 545 U.S. 913, 936–37 (2005). See also Broad. Music, Inc. v. Hartmarx Corp., 1988 WL 128691, at *3 (N.D. Ill. Nov. 22, 1988) (“A corporate officer who directs, controls, ratifies, participates in, or is the moving force behind the infringing activity, is personally liable…”); Columbia Pictures Indus., Inc. v. Fung, 710 F.3d 1020 (9th Cir. 2013) (operator liability tied to inducement and encouragement of infringement); and then my personal favorite, Arista Records LLC v. Lime Group LLC, 784 F. Supp. 2d 398 (S.D.N.Y. 2011) (evidence of executive knowledge and encouragement relevant to secondary liability).

If Suno’s leadership approved the acquisition, copying, ingestion, or retention of copyrighted sound recordings for model training, plaintiffs may argue that the executives were not passive corporate managers. They were decision-makers in the alleged infringement pipeline.

If discovery shows that senior executives knew copyrighted commercial recordings were being copied, discussed licensing risk, chose not to license, or treated infringement exposure as a cost of doing business, the case could begin to look more like direct participation or inducement than ordinary corporate oversight. For example, Complete Music Update quotes Mikey as like “…admitting to using copyright protected music in his company’s AI training data, something that he describes as ‘stock standard’ practice that ‘every AI company does.’” He evidently said this as part of an interview he gave to leading venture capital industry podcast The Twenty Minute VC. Now I’m not saying that statement alone is enough to close a case, but it certainly is one of those whatchamacalits, an admission against interest.


Shulman’s statement is significant because it is not merely a generalized industry observation. It is an admission by a senior corporate officer that his company Suno used copyrighted works in AI training and that the practice was understood internally at Suno as normal operating procedure. In civil discovery, that seems more than enough to justify targeted subpoenas designed to identify the scope, intent and commercial exploitation of the alleged infringement. And who else participated in the policy implementation.

Courts permit broad discovery where a plaintiff can show a reasonable basis to believe relevant evidence exists. Here, the CEO publicly acknowledged both (1) use of copyrighted music in training data and (2) awareness that such conduct implicated copyright law. The statement therefore supports discovery into knowledge, willfulness, inducement and commercial benefit under cases like GroksterFung, and Lime Group.

The quote particularly supports subpoenas for:

  • Training datasets and provenance records identifying sound recordings, compositions, stems, embeddings, fingerprints, metadata or source libraries used in model training;
  • Internal communications discussing ingestion of copyrighted music, licensing avoidance, fair use strategy, risk assessments or litigation exposure, including with members of the Suno board of directors;
  • Board materials and investor presentations discussing training practices, copyright risk, or competitive advantages derived from unlicensed datasets;
  • Engineering documents concerning scraping pipelines, dataset assembly, deduplication, filtering and retention of copyrighted material;
  • Financial records showing revenues, subscriptions, enterprise deals or valuations tied to models trained on copyrighted works;
  • Communications with third-party dataset providers, cloud vendors or contractors involved in obtaining or processing music files;
  • Prompt/output testing records showing whether models could reproduce recognizable musical expression, styles, voices or commercially substitutive outputs;
  • Policies regarding removal requests, provenance tracking, watermarking or rights management; and
  • Executive communications, including those involving Shulman personally, concerning decisions to proceed despite known copyright objections.

The statement also strengthens arguments for discovery into willful infringement. Saying that infringement is “stock standard” and that “every AI company does” it can be framed not as innocence, but as evidence of conscious normalization of unlawful conduct. Plaintiffs could argue this reflects industry-wide deliberate disregard for licensing obligations rather than accidental or technically unavoidable copying.

Finally, the quote helps establish proportionality. Suno itself has publicly placed copyright infringement at the center of its business model and competitive narrative. Once the CEO publicly admits the conduct, defendants have a much harder time arguing that subpoenas directed at training records, executive knowledge or dataset provenance are speculative fishing expeditions.

Naming executives can sharpen the willfulness theory. It can support discovery into board materials, investor pitches, licensing discussions, data-acquisition plans, and internal risk assessments.

These claims also may open the door to the boardroom. If discovery shows that Suno’s training strategy, licensing posture, or infringement-risk tolerance was discussed at the board level, plaintiffs may seek board materials, investor communications, voting agreements, consent rights, and other governance documents. Yes, the entire odious apparatus.

That may be exceptionally relevant and productive especially if major investors had approval rights, information rights, veto rights, or board seats tied to key business decisions. In that scenario, the inquiry may not stop with management. It could reach the investors who helped authorize, finance, or control the strategy that made the alleged infringement commercially valuable.

Public reporting identifies Menlo, Lightspeed, Matrix, Founder Collective, Nat Friedman, Daniel Gross, NVentures/Nvidia, and Hallwood Media as Suno investors. I have not found a public source confirming which, if any, hold board seats or board-observer rights. Given the size and lead-investor status of Menlo and Lightspeed, board or observer rights would be plausible and even typical, but that should be confirmed through charter documents, investor rights agreements, board minutes, cap table materials, or other discovery.

Notably, many of these same issues are already surfacing in the book publisher plus Scott Turow litigation against Meta and Mark Zuckerberg, including the allegations raised in the Elsevier-related AI copyright cases and the broader author lawsuits against Meta.

Plaintiffs in those matters have increasingly focused not only on the existence of infringing training datasets, but on executive-level awareness, internal discussions concerning licensing risk, data acquisition strategy, and decisions to proceed despite known copyright concerns.

The same dynamics may emerge in the Suno litigation if discovery reveals board-level discussions, investor oversight, or strategic decisions concerning whether copyrighted music catalogs would be licensed, copied without permission, or treated as a litigation risk worth taking.

The Potential Shareholder Suit

Developing a detailed factual record against Mikey Shulman (or Mark Zuckerberg) could significantly increase the risk of a future shareholder derivative suit because it potentially transforms the case from “the company made aggressive legal bets” into “management knowingly exposed the company to massive liability while failing to fulfill fiduciary duties.”

A derivative case would likely center on fiduciary duty theories under Delaware law — particularly the duties of loyalty, oversight (Caremark), disclosure, and good faith.

The pathway looks something like this:

  1. Public admissions establish scienter groundwork

    Shulman’s statements that using copyrighted works was “stock standard” and that “every AI company does” infringement could be framed as evidence that senior management understood the conduct implicated copyright law from the outset. Plaintiffs in a derivative action would argue this was not inadvertent infringement or a technical edge case, but a conscious business strategy. Of course, it would also be interesting to see if we could find out exactly what made Mikey say such things? Any meetings he’d like to discuss? All like very democratic, I’m like so sure.
  2. Discovery in copyright litigation creates the evidentiary record

    The underlying copyright cases are what really matter. If discovery uncovers:
    • internal discussions acknowledging piracy risks,

    • deliberate avoidance of licensing,executive-level approval of infringing datasets,warnings from counsel or employees,or

    • efforts to conceal provenance,

    then plaintiffs’ firms would likely use that material to argue the board failed to exercise oversight or knowingly permitted unlawful conduct.
  3. Massive enterprise risk can trigger Caremark-style claims

    Delaware courts increasingly recognize that boards must monitor “mission critical” legal risks. For Suno, copyright compliance is not peripheral — it is existential. The entire company depends on ingesting copyrighted music. If plaintiffs could show there were inadequate controls over training data provenance, licensing, or infringement risk, they could argue the board ignored core compliance obligations.
  4. Investor disclosures become vulnerable

    Once litigation and discovery mature, shareholders may ask whether fundraising materials accurately described legal risks. If management portrayed datasets as compliant, transformative, or low-risk while internally acknowledging likely infringement, that creates exposure around disclosure duties and securities-related claims.
  5. Personal enrichment allegations amplify pressure

    Derivative plaintiffs often focus on:
    • executive compensation,liquidity events,fundraising rounds,valuation increases,and insider sales.
    The theory becomes: executives increased enterprise value through unlawful conduct while externalizing legal risk onto the corporation and shareholders.
  6. Insurance and indemnification issues emerge

    Findings of willful misconduct or bad faith can create disputes over D&O insurance coverage and indemnification rights. That dramatically increases settlement pressure and board conflict concerns.

The important strategic point is that copyright plaintiffs do not need to bring the derivative suit themselves. They only need to build the factual record. Once discovery produces emails, board materials, or executive communications suggesting knowing infringement or oversight failures, shareholder firms may step in independently.

That is why executive statements like, you know, matter so much. Public comments can later be connected to internal documents to argue that management knew exactly what it was doing, understood the legal exposure, and proceeded anyway because rapid AI scaling and market capture were prioritized over licensing compliance.

And who wants to bet that the board was leading the charge?

The Constitutional Shadow of the White House AI Framework: Law Without Law

One of the most important things about the White House AI framework released last week is what it is not.

It is not an executive order.

That may sound like a technical distinction, but it is doing an enormous amount of work here. Because by avoiding the form of an executive order, the framework avoids something even more important: Judicial review.

An executive order that attempted to declare AI training on copyrighted works lawful—or to constrain Congress from acting—would immediately invite challenge in the very judicial branch the framework also seeks to influence. Oh, that would be fun.

It would raise Administrative Procedure Act questions. It would trigger separation-of-powers scrutiny. It would likely be litigated within days.

This framework does none of that and is not susceptible to judicial challenge.

Instead, it achieves much of the same practical effect—shaping legal outcomes, constraining policy space, and signaling preferred doctrine—without creating a justiciable action. It is, in effect, law without law, and outcomes by positioning. Silicon Valley’s favorite.

Takings by Policy, Not Statute

Start with the most obvious constitutional issue: the Takings Clause of Fifth Amendment of the U.S. Constitution which states that “private property [cannot] be taken for public use, without just compensation.”

Copyright is a form of property. That is not controversial. It is a statutory property right grounded in the Constitution’s Intellectual Property Clause, and it carries exclusive rights that have long been understood as economically valuable.

Now consider what the White House framework does.

It declares that AI training—mass, indiscriminate ingestion of copyrighted works—as lawful. It does so without requiring compensation. And it does so in a context where the resulting systems can substitute for, or diminish the market for, the original works.

If that official policy position of the Executive Branch were enacted into law, it would raise a straightforward question:

Has the government authorized the use of private property for public and commercial purposes without compensation? Or more directly, has the Executive Branch just announced that will not prosecute that indiscriminate ingestion for any reason? Can we expect to see amicus briefs from the Solicitor General opposing copyright owners pursuing their rights in court?

That is sounding a lot like a taking.

But because the framework is not law, it avoids the moment where that question must be answered. It does not extinguish rights formally. It renders them economically hollow in practice, while leaving the formal structure intact.

That is the key move: functional elimination without formal abolition.

Ex Post Facto in Everything but Name

The framework also raises a second, less discussed issue: the logic of ex post facto lawmaking.

The Ex Post Facto Clause technically applies to criminal law. But the underlying principle is broader: the government should not change the legal consequences of past conduct to benefit favored actors or disadvantage others. Of course, copyright owners raising this argument will have the Spotify retroactive safe harbor in Title I of the Music Modernization Act thrown in their face as rank hypocrisy, which they would richly deserve, although as any 10 year old can tell you, two wrongs don’t make a right, at least in theory.

Here, the timeline matters.

  • Massive datasets have already been scraped.
  • Models have already been trained.
  • The conduct that enabled this may, in many instances, have been legally questionable—and in cases of willful infringement, potentially criminal under federal copyright law. Or if you listen to me, the largest case of criminal copyright infringement in history.

Now comes the policy years after the fact in the face of over 150 AI lawsuits all based on copyright infringement to one degree or another:

Training is lawful.

That looks less like interpretation and more like retroactive validation.

Even if framed as civil doctrine, the effect is similar to retroactive decriminalization of conduct tied to vested rights. It sends a clear message: conduct that may have been unlawful when undertaken will be treated as lawful because it is now economically indispensable to the broligarchs.

That is not how the rule of law is supposed to work.

Separation of Powers by Suggestion

The framework’s treatment of Congress is equally striking. It does not say Congress lacks authority to legislate. The President cannot say that. Well…he can, but there’s no foundation for the statement. The Constitution is clear: Congress defines copyright.

Instead, the framework says Congress should not act in ways that would affect judicial resolution of the training question.

That is an unusual formulation. Congress legislates in areas under litigation all the time. Indeed, it is often expected to clarify statutory ambiguity.

What the framework is doing is more subtle: It is attempting to shape the legislative field without formally constraining it.

And it pairs that with an implicit second message:

  • Legislation that restricts training or mandates licensing is inconsistent with executive policy.
  • Such legislation is therefore unlikely to be signed by the President. So why bring it?

That is a veto signal—delivered without the political cost of an actual veto.

Judicial Signaling Without Command

The same dynamic applies to the courts.

The framework claims to “defer” to the judiciary. But it simultaneously declares a preferred outcome: training is lawful.

That is not deference. That is signaling.

Judges are, of course, independent. But they do not operate in a vacuum. They are aware of executive priorities, legislative inaction, and market realities. When all three align around a single policy direction, it creates an interpretive gravitational force that is difficult to ignore.

And the signal travels further.

To lawyers.
To regulators.
To anyone whose career may intersect with executive appointment.

It normalizes what counts as a “reasonable” position within the current policy environment.

Prosecutorial Silence as Policy

There is also a more immediate, practical consequence.

While the framework does not have the force of law, it functions as an indirect directive to the Department of Justice. By declaring training lawful as a matter of policy, it signals that federal enforcement resources should not be used to pursue cases premised on the opposite view.

In effect, it tells prosecutors:

Do not spend time considering criminal enforcement for large-scale copyright violations tied to AI training. Do not spend time considering antitrust enforcement against the broligarchs. In fact, don’t spend any time prosecuting anyone regarding AI.

That matters because, for example, willful copyright infringement at scale can, in certain circumstances, give rise to criminal liability. I mean if that doesn’t, what does? Yet under this framework, even the possibility of such enforcement is quietly set aside.

This is not formal immunity. But in practice, it can look very similar.

Why “Not an Executive Order” Matters

If this were an executive order, all of these issues would be front and center:

  • Is this a taking?
  • Does it exceed executive authority?
  • Does it interfere with Congress?
  • Does it interfere with the Judiciary?

Because it is not and EO, these important issues remain in the background—present but untested.

That is the genius, and the danger, of the approach.

It allows the executive branch to:

  • Shape doctrine
  • Influence courts
  • Constrain Congress
  • Guide enforcement priorities
  • Normalize contested conduct

—all without triggering the mechanisms designed to check it.

The Constitutional Shadow

The AI framework does not violate the Constitution in any formal sense.

It does something more complicated.

It operates in the constitutional shadow—where policy can reshape rights, incentives, and expectations without ever crossing the line that would allow a court to say no.

But shadows matter.

Because by the time the law catches up—if it ever does—the world the Constitution was meant to govern and protect may already have changed.

What Would Freud Do? The Unconscious Is Not a Database — and Humans Are Not Machines

What would Freud do?

It’s a strange question to ask about AI and copyright, but a useful one. When generative-AI fans insist that training models on copyrighted works is merely “learning like a human,” they rely on a metaphor that collapses under even minimal scrutiny. Psychoanalysis—whatever one thinks of Freud’s conclusions—begins from a premise that modern AI rhetoric quietly denies: the unconscious is not a database, and humans are not machines.

As Freud wrote in The Interpretation of Dreams, “Our memory has no guarantees at all, and yet we bow more often than is objectively justified to the compulsion to believe what it says.” No AI truthiness there.

Human learning does not involve storing perfect, retrievable copies of what we read, hear, or see. Memory is reconstructive, shaped by context, emotion, repression, and time. Dreams do not replay inputs; they transform them. What persists is meaning, not a file.

AI training works in the opposite direction—obviously. Training begins with high-fidelity copying at industrial scale. It converts human expressive works into durable statistical parameters designed for reuse, recall, and synthesis for eternity. Where the human mind forgets, distorts, and misremembers as a feature of cognition, models are engineered to remember as much as possible, as efficiently as possible, and to deploy those memories at superhuman speed. Nothing like humans.

Calling these two processes “the same kind of learning” is not analogy—it is misdirection. And that misdirection matters, because copyright law was built around the limits of human expression: scarcity, imperfection, and the fact that learning does not itself create substitute works at scale.

Dream-Work Is Not a Training Pipeline

Freud’s theory of dreams turns on a simple but powerful idea: the mind does not preserve experience intact. Instead, it subjects experience to dream-work—processes like condensation (many ideas collapsed into one image), displacement (emotional significance shifted from one object to another), and symbolization (one thing representing another, allowing humans to create meaning and understanding through symbols). The result is not a copy of reality but a distorted, overdetermined construction whose origins cannot be cleanly traced.

This matters because it shows what makes human learning human. We do not internalize works as stable assets. We metabolize them. Our memories are partial, fallible, and personal. Two people can read the same book and walk away with radically different understandings—and neither “contains” the book afterward in any meaningful sense. There is no Rashamon effect for an AI.

AI training is the inverse of dream-work. It depends on perfect copying at ingestion, retention of expressive regularities across vast parameter spaces, and repeatable reuse untethered from embodiment, biography, or forgetting. If Freud’s model describes learning as transformation through loss, AI training is transformation through compression without forgetting.

One produces meaning. The other produces capacity.

The Unconscious Is Not a Database

Psychoanalysis rejects the idea that memory functions like a filing cabinet. The unconscious is not a warehouse of intact records waiting to be retrieved. Memory is reconstructed each time it is recalled, reshaped by narrative, emotion, and social context. Forgetting is not a failure of the system; it is a defining feature.

AI systems are built on the opposite premise. Training assumes that more retention is better, that fidelity is a virtue, and that expressive regularities should remain available for reuse indefinitely. What human cognition resists by design—perfect recall at scale—machine learning seeks to maximize.

This distinction alone is fatal to the “AI learns like a human” claim. Human learning is inseparable from distortion, limitation, and individuality. AI training is inseparable from durability, scalability, and reuse.

In The Divided Self, R. D. Laing rejects the idea that the mind is a kind of internal machine storing stable representations of experience. What we encounter instead is a self that exists only precariously, defined by what Laing calls ontological security” or its absence—the sense of being real, continuous, and alive in relation to others. Experience, for Laing, is not an object that can be detached, stored, or replayed; it is lived, relational, and vulnerable to distortion. He warns repeatedly against confusing outward coherence with inner unity, emphasizing that a person may present a fluent, organized surface while remaining profoundly divided within. That distinction matters here: performance is not understanding, and intelligible output is not evidence of an interior life that has “learned” in any human sense.

Why “Unlearning” Is Not Forgetting

Once you understand this distinction, the problem with AI “unlearning” becomes obvious.

In human cognition, there is no clean undo. Memories are never stored as discrete objects that can be removed without consequence. They reappear in altered forms, entangled with other experiences. Freud’s entire thesis rests on the impossibility of clean erasure.

AI systems face the opposite dilemma. They begin with discrete, often unlawful copies, but once those works are distributed across parameters, they cannot be surgically removed with certainty. At best, developers can stop future use, delete datasets, retrain models, or apply partial mitigation techniques (none of which they are willing to even attempt). What they cannot do is prove that the expressive contribution of a particular work has been fully excised.

This is why promises (especially contractual promises) to “reverse” improper ingestion are so often overstated. The system was never designed for forgetting. It was designed for reuse.

Why This Matters for Fair Use and Market Harm

The “AI = human learning” analogy does real damage in copyright analysis because it smuggles conclusions into fair-use factor one (transformative purpose and character) and obscures factor four (market harm).

Learning has always been tolerated under copyright law because learning does not flood markets. Humans do not emerge from reading a novel with the ability to generate thousands of competing substitutes at scale. Generative models do exactly that—and only because they are trained through industrial-scale copying.

Copyright law is calibrated to human limits. When those limits disappear, the analysis must change with them. Treating AI training as merely “learning” collapses the very distinction that makes large-scale substitution legally and economically significant.

The Pensieve Fallacy

There is a world in which minds function like databases. It is a fictional one.

In Harry Potter and the Goblet of Fire, wizards can extract memories, store them in vials, and replay them perfectly using a Pensieve. Memories in that universe are discrete, stable, lossless objects. They can be removed, shared, duplicated, and inspected without distortion. As Dumbledore explained to Harry, “I use the Pensieve. One simply siphons the excess thoughts from one’s mind, pours them into the basin, and examines them at one’s leisure. It becomes easier to spot patterns and links, you understand, when they are in this form.”

That is precisely how AI advocates want us to imagine learning works.

But the Pensieve is magic because it violates everything we know about human cognition. Real memory is not extractable. It cannot be replayed faithfully. It cannot be separated from the person who experienced it. Arguably, Freud’s work exists because memory is unstable, interpretive, and shaped by conflict and context.

AI training, by contrast, operates far closer to the Pensieve than to the human mind. It depends on perfect copies, durable internal representations, and the ability to replay and recombine expressive material at will.

The irony is unavoidable: the metaphor that claims to make AI training ordinary only works by invoking fantasy.

Humans Forget. Machines Remember.

Freud would not have been persuaded by the claim that machines “learn like humans.” He would have rejected it as a category error. Human cognition is defined by imperfection, distortion, and forgetting. AI training is defined by reproduction, scale, and recall.

To believe AI learns like a human, you have to believe humans have Pensieves. They don’t. That’s why Pensieves appear in Harry Potter—not neuroscience, copyright law, or reality.

Deduplication and Discovery: The Smoking Gun in the Machine

WINSTON

“Wipe up all those little pieces of brains and skull”

From Pulp Fiction, screenplay by Quentin Tarantino and Roger Avary

Deduplication—the process of removing identical or near-identical content from AI training data—is a critical yet often overlooked indicator that AI platforms actively monitor and curate their training sets. This is the kind of process that one would expect given the kind of “scrape, ready, aim” business practices that seems precisely the approach of AI platforms that have ready access to large amounts of fairly high quality data from users of other products placed into commerce by business affiliates or confederates of the AI platforms.

For example, Google Gemini could have access to gmail, YouTube, at least “publicly available” Google Docs, Google Translate, or Google for Education, and then of course one of the great scams of all time, Google Books. Microsoft uses Bing searches, MSN browsing, the consumer Copilot experience, and ad interactions. Amazon uses Alexa prompts, Facebook uses “public” posts and so on.

This kind of hoovering up of indiscriminate amounts of “data” in the form of your baby pictures posted on Facebook and your user generated content on YouTube is bound to produce duplicates. After all, how may users have posted their favorite Billie Eilish or Taylor Swift music video. AI doesn’t need 10000 versions of “Shake it Off” they probably just need the official video. Enter deduplication–which by definition means the platform knows what it has scraped and also knows what it wants to get rid of.

“Get rid of” is a relative concept. In many systems—particularly in storage environments like backup servers or object stores—deduplication means keeping only one physical copy of a file. Any other instances of that data don’t get stored again; instead, they’re represented by pointers to the original copy. This approach, known as inline deduplication, happens in real time and minimizes storage waste without actually deleting anything of functional value. It requires knowing what you have, knowing you have more than one version of the same thing, and being able to tell the system where to look to find the “original” copy without disturbing the process and burning compute inefficiently.

In other cases, such as post-process deduplication, the system stores data initially, then later scans for and eliminates redundancies. Again, the AI platform knows there are two or more versions of the same thing, say the book Being and Nothingness, knows where to find the copies and has been trained to keep only one version. Even here, the duplicates may not be permanently erased—they might be archived, versioned, or logged for auditing, compliance, or reconstruction purposes.

In AI training contexts, deduplication usually means removing redundant examples from the training set to avoid copyright risk. The duplicate content may be discarded from the training pipeline but often isn’t destroyed. Instead, AI companies may retain it in a separate filtered corpus or keep hashed fingerprints to ensure future models don’t retrain on the same material unknowingly.

So they know what they have, and likely know where it came from. They just don’t want to tell any plaintiffs.

Ultimately, deduplication is less about destruction and more about optimization. It’s a way to reduce noise, save resources, and improve performance—while still allowing systems to track, reference, or even rehydrate the original data if needed.

Its existence directly undermines claims that companies are unaware of which copyrighted works were ingested. Indeed, it only makes sense that one of the hidden consequences of the indiscriminate scraping that underpins large-scale AI training is the proliferation of duplicated data. Web crawlers ingest everything they can access—news articles republished across syndicates, forum posts echoed in aggregation sites, Wikipedia mirrors, boilerplate license terms, spammy SEO farms repeating the same language over and over. Without any filtering, this avalanche of redundant content floods the training pipeline.

This is where deduplication becomes not just useful, but essential. It’s the cleanup crew after a massive data land grab. The more messy and indiscriminate the scraping, the more aggressively the model must filter for quality, relevance, and uniqueness to avoid training inefficiencies or—worse—model behaviors that are skewed by repetition. If a model sees the same phrase or opinion thousands of times, it might assume it’s authoritative or universally accepted, even if it’s just a meme bouncing around low-quality content farms.

Deduplication is sort of the Winston Wolf of AI. And if the cleaner shows up, somebody had to order the cleanup. It is a direct response to the excesses of indiscriminate scraping. It’s both a technical fix and a quiet admission that the underlying data collection strategy is, by design, uncontrolled. But while the scraping may be uncontrolled to get copies of as much of your data has they can lay hands on, even by cleverly changing their terms of use boilerplate so they can do all this under the effluvia of legality, they send in the cleaner to take care of the crime scene.

So to summarize: To deduplicate, platforms must identify content-level matches (e.g., multiple copies of Being and Nothingness by Jean-Paul Sartre). This process requires tools that compare, fingerprint, or embed full documents—meaning the content is readable and classifiable–and, oh, yes, discoverable.

Platforms may choose the ‘cleanest’ copy to keep, showing knowledge and active decision-making about which version of a copyrighted work is retained. And–big finish–removing duplicates only makes sense if operators know which datasets they scraped and what those datasets contain.

Drilling down on a platform’s deduplication tools and practices may prove up knowledge and intent to a precise degree—contradicting arguments of plausible deniability in litigation. Johnny ate the cookies isn’t going to fly. There’s a market clearing level of record keeping necessary for deduping to work at all, so it’s likely that there are internal deduplication logs or tooling pipelines that are discoverable.

When AI platforms object to discovery about deduplication, plaintiffs can often overcome those objections by narrowing their focus. Rather than requesting broad details about how a model deduplicates its entire training set, plaintiffs should ask a simple, specific question: Were any of these known works—identified by title or author—deduplicated or excluded from training?

This approach avoids objections about overbreadth or burden. It reframes discovery as a factual inquiry, not a technical deep dive. If the platform claims the data was not retained, plaintiffs can ask for existing artifacts—like hash filters, logs, or manifests—or seek a sworn statement explaining the loss and when it occurred. That, in turn, opens the door to potential spoliation arguments.

If trade secrets are cited, plaintiffs can propose a protective order, limiting access to outside counsel or experts like we’ve done 100,000 times before in other cases. And if the defendant claims “duplicate” is too vague, plaintiffs can define it functionally—as content that’s identical or substantially similar, by hash, tokens, or vectors.

Most importantly, deduplication is relevant. If a platform identified a plaintiff’s work and trained on it anyway, that speaks to volitional use, copying, and lack of care—key issues in copyright and fair use analysis. And if they lied about it, particularly to the court—Helloooooo Harper & Row. Discovery requests that are focused, tailored, and anchored in specific works stand a far better chance of surviving objections and yielding meaningful evidence which hopefully will be useful and lead to other positive results.