What Would Freud Do? The Unconscious Is Not a Database — and Humans Are Not Machines

What would Freud do?

It’s a strange question to ask about AI and copyright, but a useful one. When generative-AI fans insist that training models on copyrighted works is merely “learning like a human,” they rely on a metaphor that collapses under even minimal scrutiny. Psychoanalysis—whatever one thinks of Freud’s conclusions—begins from a premise that modern AI rhetoric quietly denies: the unconscious is not a database, and humans are not machines.

As Freud wrote in The Interpretation of Dreams, “Our memory has no guarantees at all, and yet we bow more often than is objectively justified to the compulsion to believe what it says.” No AI truthiness there.

Human learning does not involve storing perfect, retrievable copies of what we read, hear, or see. Memory is reconstructive, shaped by context, emotion, repression, and time. Dreams do not replay inputs; they transform them. What persists is meaning, not a file.

AI training works in the opposite direction—obviously. Training begins with high-fidelity copying at industrial scale. It converts human expressive works into durable statistical parameters designed for reuse, recall, and synthesis for eternity. Where the human mind forgets, distorts, and misremembers as a feature of cognition, models are engineered to remember as much as possible, as efficiently as possible, and to deploy those memories at superhuman speed. Nothing like humans.

Calling these two processes “the same kind of learning” is not analogy—it is misdirection. And that misdirection matters, because copyright law was built around the limits of human expression: scarcity, imperfection, and the fact that learning does not itself create substitute works at scale.

Dream-Work Is Not a Training Pipeline

Freud’s theory of dreams turns on a simple but powerful idea: the mind does not preserve experience intact. Instead, it subjects experience to dream-work—processes like condensation (many ideas collapsed into one image), displacement (emotional significance shifted from one object to another), and symbolization (one thing representing another, allowing humans to create meaning and understanding through symbols). The result is not a copy of reality but a distorted, overdetermined construction whose origins cannot be cleanly traced.

This matters because it shows what makes human learning human. We do not internalize works as stable assets. We metabolize them. Our memories are partial, fallible, and personal. Two people can read the same book and walk away with radically different understandings—and neither “contains” the book afterward in any meaningful sense. There is no Rashamon effect for an AI.

AI training is the inverse of dream-work. It depends on perfect copying at ingestion, retention of expressive regularities across vast parameter spaces, and repeatable reuse untethered from embodiment, biography, or forgetting. If Freud’s model describes learning as transformation through loss, AI training is transformation through compression without forgetting.

One produces meaning. The other produces capacity.

The Unconscious Is Not a Database

Psychoanalysis rejects the idea that memory functions like a filing cabinet. The unconscious is not a warehouse of intact records waiting to be retrieved. Memory is reconstructed each time it is recalled, reshaped by narrative, emotion, and social context. Forgetting is not a failure of the system; it is a defining feature.

AI systems are built on the opposite premise. Training assumes that more retention is better, that fidelity is a virtue, and that expressive regularities should remain available for reuse indefinitely. What human cognition resists by design—perfect recall at scale—machine learning seeks to maximize.

This distinction alone is fatal to the “AI learns like a human” claim. Human learning is inseparable from distortion, limitation, and individuality. AI training is inseparable from durability, scalability, and reuse.

In The Divided Self, R. D. Laing rejects the idea that the mind is a kind of internal machine storing stable representations of experience. What we encounter instead is a self that exists only precariously, defined by what Laing calls ontological security” or its absence—the sense of being real, continuous, and alive in relation to others. Experience, for Laing, is not an object that can be detached, stored, or replayed; it is lived, relational, and vulnerable to distortion. He warns repeatedly against confusing outward coherence with inner unity, emphasizing that a person may present a fluent, organized surface while remaining profoundly divided within. That distinction matters here: performance is not understanding, and intelligible output is not evidence of an interior life that has “learned” in any human sense.

Why “Unlearning” Is Not Forgetting

Once you understand this distinction, the problem with AI “unlearning” becomes obvious.

In human cognition, there is no clean undo. Memories are never stored as discrete objects that can be removed without consequence. They reappear in altered forms, entangled with other experiences. Freud’s entire thesis rests on the impossibility of clean erasure.

AI systems face the opposite dilemma. They begin with discrete, often unlawful copies, but once those works are distributed across parameters, they cannot be surgically removed with certainty. At best, developers can stop future use, delete datasets, retrain models, or apply partial mitigation techniques (none of which they are willing to even attempt). What they cannot do is prove that the expressive contribution of a particular work has been fully excised.

This is why promises (especially contractual promises) to “reverse” improper ingestion are so often overstated. The system was never designed for forgetting. It was designed for reuse.

Why This Matters for Fair Use and Market Harm

The “AI = human learning” analogy does real damage in copyright analysis because it smuggles conclusions into fair-use factor one (transformative purpose and character) and obscures factor four (market harm).

Learning has always been tolerated under copyright law because learning does not flood markets. Humans do not emerge from reading a novel with the ability to generate thousands of competing substitutes at scale. Generative models do exactly that—and only because they are trained through industrial-scale copying.

Copyright law is calibrated to human limits. When those limits disappear, the analysis must change with them. Treating AI training as merely “learning” collapses the very distinction that makes large-scale substitution legally and economically significant.

The Pensieve Fallacy

There is a world in which minds function like databases. It is a fictional one.

In Harry Potter and the Goblet of Fire, wizards can extract memories, store them in vials, and replay them perfectly using a Pensieve. Memories in that universe are discrete, stable, lossless objects. They can be removed, shared, duplicated, and inspected without distortion. As Dumbledore explained to Harry, “I use the Pensieve. One simply siphons the excess thoughts from one’s mind, pours them into the basin, and examines them at one’s leisure. It becomes easier to spot patterns and links, you understand, when they are in this form.”

That is precisely how AI advocates want us to imagine learning works.

But the Pensieve is magic because it violates everything we know about human cognition. Real memory is not extractable. It cannot be replayed faithfully. It cannot be separated from the person who experienced it. Arguably, Freud’s work exists because memory is unstable, interpretive, and shaped by conflict and context.

AI training, by contrast, operates far closer to the Pensieve than to the human mind. It depends on perfect copies, durable internal representations, and the ability to replay and recombine expressive material at will.

The irony is unavoidable: the metaphor that claims to make AI training ordinary only works by invoking fantasy.

Humans Forget. Machines Remember.

Freud would not have been persuaded by the claim that machines “learn like humans.” He would have rejected it as a category error. Human cognition is defined by imperfection, distortion, and forgetting. AI training is defined by reproduction, scale, and recall.

To believe AI learns like a human, you have to believe humans have Pensieves. They don’t. That’s why Pensieves appear in Harry Potter—not neuroscience, copyright law, or reality.

Deduplication and Discovery: The Smoking Gun in the Machine

WINSTON

“Wipe up all those little pieces of brains and skull”

From Pulp Fiction, screenplay by Quentin Tarantino and Roger Avary

Deduplication—the process of removing identical or near-identical content from AI training data—is a critical yet often overlooked indicator that AI platforms actively monitor and curate their training sets. This is the kind of process that one would expect given the kind of “scrape, ready, aim” business practices that seems precisely the approach of AI platforms that have ready access to large amounts of fairly high quality data from users of other products placed into commerce by business affiliates or confederates of the AI platforms.

For example, Google Gemini could have access to gmail, YouTube, at least “publicly available” Google Docs, Google Translate, or Google for Education, and then of course one of the great scams of all time, Google Books. Microsoft uses Bing searches, MSN browsing, the consumer Copilot experience, and ad interactions. Amazon uses Alexa prompts, Facebook uses “public” posts and so on.

This kind of hoovering up of indiscriminate amounts of “data” in the form of your baby pictures posted on Facebook and your user generated content on YouTube is bound to produce duplicates. After all, how may users have posted their favorite Billie Eilish or Taylor Swift music video. AI doesn’t need 10000 versions of “Shake it Off” they probably just need the official video. Enter deduplication–which by definition means the platform knows what it has scraped and also knows what it wants to get rid of.

“Get rid of” is a relative concept. In many systems—particularly in storage environments like backup servers or object stores—deduplication means keeping only one physical copy of a file. Any other instances of that data don’t get stored again; instead, they’re represented by pointers to the original copy. This approach, known as inline deduplication, happens in real time and minimizes storage waste without actually deleting anything of functional value. It requires knowing what you have, knowing you have more than one version of the same thing, and being able to tell the system where to look to find the “original” copy without disturbing the process and burning compute inefficiently.

In other cases, such as post-process deduplication, the system stores data initially, then later scans for and eliminates redundancies. Again, the AI platform knows there are two or more versions of the same thing, say the book Being and Nothingness, knows where to find the copies and has been trained to keep only one version. Even here, the duplicates may not be permanently erased—they might be archived, versioned, or logged for auditing, compliance, or reconstruction purposes.

In AI training contexts, deduplication usually means removing redundant examples from the training set to avoid copyright risk. The duplicate content may be discarded from the training pipeline but often isn’t destroyed. Instead, AI companies may retain it in a separate filtered corpus or keep hashed fingerprints to ensure future models don’t retrain on the same material unknowingly.

So they know what they have, and likely know where it came from. They just don’t want to tell any plaintiffs.

Ultimately, deduplication is less about destruction and more about optimization. It’s a way to reduce noise, save resources, and improve performance—while still allowing systems to track, reference, or even rehydrate the original data if needed.

Its existence directly undermines claims that companies are unaware of which copyrighted works were ingested. Indeed, it only makes sense that one of the hidden consequences of the indiscriminate scraping that underpins large-scale AI training is the proliferation of duplicated data. Web crawlers ingest everything they can access—news articles republished across syndicates, forum posts echoed in aggregation sites, Wikipedia mirrors, boilerplate license terms, spammy SEO farms repeating the same language over and over. Without any filtering, this avalanche of redundant content floods the training pipeline.

This is where deduplication becomes not just useful, but essential. It’s the cleanup crew after a massive data land grab. The more messy and indiscriminate the scraping, the more aggressively the model must filter for quality, relevance, and uniqueness to avoid training inefficiencies or—worse—model behaviors that are skewed by repetition. If a model sees the same phrase or opinion thousands of times, it might assume it’s authoritative or universally accepted, even if it’s just a meme bouncing around low-quality content farms.

Deduplication is sort of the Winston Wolf of AI. And if the cleaner shows up, somebody had to order the cleanup. It is a direct response to the excesses of indiscriminate scraping. It’s both a technical fix and a quiet admission that the underlying data collection strategy is, by design, uncontrolled. But while the scraping may be uncontrolled to get copies of as much of your data has they can lay hands on, even by cleverly changing their terms of use boilerplate so they can do all this under the effluvia of legality, they send in the cleaner to take care of the crime scene.

So to summarize: To deduplicate, platforms must identify content-level matches (e.g., multiple copies of Being and Nothingness by Jean-Paul Sartre). This process requires tools that compare, fingerprint, or embed full documents—meaning the content is readable and classifiable–and, oh, yes, discoverable.

Platforms may choose the ‘cleanest’ copy to keep, showing knowledge and active decision-making about which version of a copyrighted work is retained. And–big finish–removing duplicates only makes sense if operators know which datasets they scraped and what those datasets contain.

Drilling down on a platform’s deduplication tools and practices may prove up knowledge and intent to a precise degree—contradicting arguments of plausible deniability in litigation. Johnny ate the cookies isn’t going to fly. There’s a market clearing level of record keeping necessary for deduping to work at all, so it’s likely that there are internal deduplication logs or tooling pipelines that are discoverable.

When AI platforms object to discovery about deduplication, plaintiffs can often overcome those objections by narrowing their focus. Rather than requesting broad details about how a model deduplicates its entire training set, plaintiffs should ask a simple, specific question: Were any of these known works—identified by title or author—deduplicated or excluded from training?

This approach avoids objections about overbreadth or burden. It reframes discovery as a factual inquiry, not a technical deep dive. If the platform claims the data was not retained, plaintiffs can ask for existing artifacts—like hash filters, logs, or manifests—or seek a sworn statement explaining the loss and when it occurred. That, in turn, opens the door to potential spoliation arguments.

If trade secrets are cited, plaintiffs can propose a protective order, limiting access to outside counsel or experts like we’ve done 100,000 times before in other cases. And if the defendant claims “duplicate” is too vague, plaintiffs can define it functionally—as content that’s identical or substantially similar, by hash, tokens, or vectors.

Most importantly, deduplication is relevant. If a platform identified a plaintiff’s work and trained on it anyway, that speaks to volitional use, copying, and lack of care—key issues in copyright and fair use analysis. And if they lied about it, particularly to the court—Helloooooo Harper & Row. Discovery requests that are focused, tailored, and anchored in specific works stand a far better chance of surviving objections and yielding meaningful evidence which hopefully will be useful and lead to other positive results.