Sony’s AI Music Attribution Tool: What It Actually Does (and What It Doesn’t)

As generative music systems like Suno and Udio move into the center of copyright debates, one question keeps coming up: Can we actually tell which songs influenced an AI-generated track? And then can we use that determination in a host of other processes like royalty payments?

Recently a number of people have pointed to research from Sony AI as evidence that the answer might be yes. Sony has publicly discussed work on tools designed to analyze the relationship between training data and AI-generated music outputs.

But the reality is a little more nuanced. Sony’s work is interesting and potentially important—but it is often misunderstood. What Sony has described is not a magic detector that can listen to a generated song and instantly reveal every recording the model trained on.

Instead, Sony is describing something more modest—and in some ways more useful.

Let’s unpack what the technology appears to do right now.

Two Problems Sony Is Trying to Solve

Sony AI has publicly discussed research in two related areas.

The first is training-data attribution. This means trying to estimate which recordings in a model’s training dataset influenced a generated output.

The second is musical similarity or version matching. This involves detecting when two pieces of music share meaningful musical material even if they are not exact copies of each other.

Sony has framed both efforts as research directions rather than a finished commercial product. In other words, this is still a developing technical approach, not a turnkey system that can produce definitive copyright answers.

Training Data Attribution in Plain English

The most relevant Sony work is a research project titled Large-Scale Training Data Attribution for Music Generative Models via Unlearning.

That title sounds intimidating, but the basic idea is fairly intuitive and also suggests the project is part of the broader machine unlearning academic discipline.

The system does not operate like Shazam. It does not simply listen to an AI-generated song and say:

“This track was trained on Song X, Song Y, and Song Z.”

Instead, the approach works more like this.

Imagine you already know—or at least suspect—which recordings were used to train the model. You have a candidate set of training tracks.

The system then asks:

Among these training recordings, which ones seem most likely to have influenced this generated output?

In other words, the system ranks influence among known candidates.

The research approach borrows from an area of machine learning called machine unlearning, which studies how particular training examples affect a model’s behavior. In simplified terms, researchers can test how the model behaves when certain training examples are removed or adjusted. If the output changes meaningfully, that suggests those examples had measurable influence.

The important point is that this is an influence-ranking tool, not a forensic detector.

It tries to answer:

“Which of these known training tracks mattered most?”

Not:

“Tell me every song the model was trained on.”

Sony’s Other Idea: Smarter Music Comparison

Sony has also described work on musical similarity detection.

Traditional audio fingerprinting systems—like those used by Shazam or Audible Magic—are very good at identifying identical recordings. If you upload the same song or a slightly altered version, the system can match it.

But generative AI raises a different problem. An AI output might resemble a song musically without copying the recording itself.

Sony’s research tries to detect those kinds of relationships.

For example, a system might notice that two tracks share melodic fragments, rhythmic patterns, harmonic progressions, or musical phrases even if the arrangement, production, or instrumentation is different.

In plain English, this kind of tool tries to answer a different question:

“Are these two pieces of music related in substance?”

Not:

“Are they the exact same recording?”

The Big Limitation: You Still Need the Training Dataset

Here’s the key limitation that often gets overlooked.

Sony’s attribution approach appears to depend on having access to the candidate training dataset.

The system works by comparing a generated output against recordings that are already known or suspected to have been used during training. It estimates influence among those candidates.

That means the system answers the question:

“Which of these training tracks influenced the output?”

But it does not answer the question:

“What unknown recordings were used to train this model?”

If the training corpus is hidden or undisclosed, the attribution system has nothing to test against.

This makes the technology conceptually similar to many machine-learning research experiments, which measure influence using known datasets. Researchers can test influence among known training examples, but they cannot reconstruct an unknown dataset from outputs alone.

What This Could Look Like in the Real World

If the training corpus were known, a practical workflow might look like this.

First, the recordings in the training corpus would be identified. Audio fingerprinting systems could match those recordings to commercial releases.

That step answers the question:

What copyrighted recordings appear in the training data?

Then an attribution tool like the one Sony describes could be used to analyze generated outputs and estimate which of those known recordings appear to have influenced them.

This would not prove copying in every case. But it could dramatically narrow the analysis—from millions of possible influences to a smaller list of likely candidates.

What Sony Has Not Claimed

Sony’s public statements do not suggest that the attribution problem is solved.

Sony has not announced a system that automatically calculates track-by-track royalty payments for AI-generated songs. Nor has it described a tool that conclusively proves copyright copying from an AI output alone.

Instead, the work is framed as research aimed at improving transparency and accountability in generative music systems.

Why Labels Might Still Be Interested

Even with these limitations, the idea could be attractive to rights holders.

If training datasets were known, attribution tools could theoretically support new ways of analyzing how music catalogs interact with generative AI systems.

For example, such tools might help support:

  • royalty allocation models
  • influence-weighted compensation frameworks
  • catalog analytics
  • AI audit trails showing how repertoire contributes to model behavior

In other words, the technology could potentially become a measurement tool for how music catalogs influence generative systems.

What Sony did and did not do (yet)

Sony’s work does not magically reveal every song an AI model trained on. And it does not eliminate the need to know what is in the training dataset.

Instead, its value appears to lie after the training data is known.

Once you have a candidate training corpus, tools like the ones Sony describes may help analyze which recordings influenced particular outputs.

That makes the technology best understood as a post-disclosure attribution layer, not a substitute for knowing what recordings were used in training in the first place.

Infrastructure, Not Aspiration: Why Permissioned AI Begins With a Hard Reset

Paul Sinclair’s framing of generative music AI as a choice between “open studios” and permissioned systems makes a basic category mistake. Consent is not a creative philosophy or a branding position. It is a systems constraint. You cannot “prefer” consent into existence. A permissioned system either enforces authorization at the level where machine learning actually occurs—or it does not exist at all.

That distinction matters not only for artists, but for the long-term viability of AI companies themselves. Platforms built on unresolved legal exposure may scale quickly, but they do so on borrowed time. Systems built on enforceable consent may grow more slowly at first, but they compound durability, defensibility, and investor confidence over time. Legality is not friction. It is infrastructure. It’s a real “eat your vegetables” moment.

The Great Reset

Before any discussion of opt-in, licensing, or future governance, one prerequisite must be stated plainly: a true permissioned system requires a hard reset of the model itself. A model trained on unlicensed material cannot be transformed into a consent-based system through policy changes, interface controls, or aspirational language. Once unauthorized material is ingested and used for training, it becomes inseparable from the trained model. There is no technical “undo” button.

The debate is often framed as openness versus restriction, innovation versus control. That framing misses the point. The real divide is whether a system is built to respect authorization where machine learning actually happens. A permissioned system cannot be layered on top of models trained without permission, nor can it be achieved by declaring legacy models “deprecated.” Machine learning systems do not forget unless they are reset. The purpose of a trained model is remembering—preserving statistical patterns learned from its data—not forgetting. Models persist, shape downstream outputs, and retain economic value long after they are removed from public view. Administrative terminology is not remediation.

Recent industry language about future “licensed models” implicitly concedes this reality. If a platform intends to operate on a consent basis, the logical consequence is unavoidable: permissioned AI begins with scrapping the contaminated model and rebuilding from zero using authorized data only.

Why “Untraining” Does Not Solve the Problem

Some argue that problematic material can simply be removed from an existing model through “untraining.” In practice, this is not a reliable solution. Modern machine-learning systems do not store discrete copies of works; they encode diffuse statistical relationships across millions or billions of parameters. Once learned, those relationships cannot be surgically excised with confidence. It’s not Harry Potter’s Pensieve.

Even where partial removal techniques exist, they are typically approximate, difficult to verify, and dependent on assumptions about how information is represented internally. A model may appear compliant while still reflecting patterns derived from unauthorized data. For systems claiming to operate on affirmative permission, approximation is not enough. If consent is foundational, the only defensible approach is reconstruction from a clean, authorized corpus.

The Structural Requirements of Consent

Once a genuine reset occurs, the technical requirements of a permissioned system become unavoidable.

Authorized training corpus. Every recording, composition, and performance used for training must be included through affirmative permission. If unauthorized works remain, the model remains non-consensual.

Provenance at the work level. Each training input must be traceable to specific authorized recordings and compositions with auditable metadata identifying the scope of permission.

Enforceable consent, including withdrawal. Authorization must allow meaningful limits and revocation, with systems capable of responding in ways that materially affect training and outputs.

Segregation of licensed and unlicensed data. Permissioned systems require strict internal separation to prevent contamination through shared embeddings or cross-trained models.

Transparency and auditability. Permission claims must be supported by documentation capable of independent verification. Transparency here is engineering documentation, not marketing copy.

These are not policy preferences. They are practical consequences of a consent-based architecture.

The Economic Reality—and Upside—of Reset

Rebuilding models from scratch is expensive. Curating authorized data, retraining systems, implementing provenance, and maintaining compliance infrastructure all require significant investment. Not every actor will be able—or willing—to bear that cost. But that burden is not an argument against permission. It is the price of admission.

Crucially, that cost is also largely non-recurring. A platform that undertakes a true reset creates something scarce in the current AI market: a verifiably permissioned model with reduced litigation risk, clearer regulatory posture, and greater long-term defensibility. Over time, such systems are more likely to attract durable partnerships, survive scrutiny, and justify sustained valuation.

Throughout technological history, companies that rebuilt to comply with emerging legal standards ultimately outperformed those that tried to outrun them. Permissioned AI follows the same pattern. What looks expensive in the short term often proves cheaper than compounding legal uncertainty.

Architecture, Not Branding

This is why distinctions between “walled garden,” “opt-in,” or other permission-based labels tend to collapse under technical scrutiny. Whatever the terminology, a system grounded in authorization must satisfy the same engineering conditions—and must begin with the same reset. Branding may vary; infrastructure does not.

Permissioned AI is possible. But it is reconstructive, not incremental. It requires acknowledging that past models are incompatible with future claims of consent. It requires making the difficult choice to start over.

The irony is that legality is not the enemy of scale—it is the only path to scale that survives. Permission is not aspiration. It is architecture.

South Korea’s AI Action Plan and the Global Drift Toward “Use First, Pay Later”

South Korea has become the latest flashpoint in a rapidly globalizing conflict over artificial intelligence, creator rights and copyright. A broad coalition of Korean creator and copyright organizations—spanning literature, journalism, broadcasting, screenwriting, music, choreography, performance, and visual arts—has issued a joint statement rejecting the government’s proposed Korea AI Action Plan, warning that it risks allowing AI companies to use copyrighted works without meaningful permission or payment.

The groups argue that the plan signals a fundamental shift away from a permission-based copyright framework toward a regime that prioritizes AI deployment speed and “legal certainty” for developers, even if that certainty comes at the expense of creators’ control and compensation. Their statement is unusually blunt: they describe the policy direction as a threat to the sustainability of Korea’s cultural industries and pledge continued opposition unless the government reverses course.

The controversy centers on Action Plan No. 32, which promotes “activating the ecosystem for the use and distribution of copyrighted works for AI training and evaluation.” The plan directs relevant ministries to prepare amendments—either to Korea’s Copyright Act, the AI Basic Act, or through a new “AI Special Act”—that would enable AI training uses of copyrighted works without legal ambiguity.

Creators argue that “eliminating legal ambiguity” reallocates legal risk rather than resolves it. Instead of clarifying consent requirements or building licensing systems, the plan appears to reduce the legal exposure of AI developers while shifting enforcement burdens onto creators through opt-out or technical self-help mechanisms.

Similar policy patterns have emerged in the United Kingdom and India, where governments have emphasized legal certainty and innovation speed while creative sectors warn of erosion to prior-permission and fair-compensation norms. South Korea’s debate stands out for the breadth of its opposition and the clarity of the warning from cultural stakeholders.

The South Korean government avoids using the term “safe harbor,” but its plan to remove “legal ambiguity” reads like an effort to build one. The asymmetry is telling: rather than eliminating ambiguity by strengthening consent and payment mechanisms, the plan seeks to eliminate ambiguity by making AI training easier to defend as lawful—without meaningful consent or compensation frameworks. That is, in substance, a safe harbor, and a species of blanket license. The resulting “certainty” would function as a pass for AI companies, while creators are left to police unauthorized use after the fact, often through impractical opt-out mechanisms—to the extent such rights remain enforceable at all.

Deduplication and Discovery: The Smoking Gun in the Machine

WINSTON

“Wipe up all those little pieces of brains and skull”

From Pulp Fiction, screenplay by Quentin Tarantino and Roger Avary

Deduplication—the process of removing identical or near-identical content from AI training data—is a critical yet often overlooked indicator that AI platforms actively monitor and curate their training sets. This is the kind of process that one would expect given the kind of “scrape, ready, aim” business practices that seems precisely the approach of AI platforms that have ready access to large amounts of fairly high quality data from users of other products placed into commerce by business affiliates or confederates of the AI platforms.

For example, Google Gemini could have access to gmail, YouTube, at least “publicly available” Google Docs, Google Translate, or Google for Education, and then of course one of the great scams of all time, Google Books. Microsoft uses Bing searches, MSN browsing, the consumer Copilot experience, and ad interactions. Amazon uses Alexa prompts, Facebook uses “public” posts and so on.

This kind of hoovering up of indiscriminate amounts of “data” in the form of your baby pictures posted on Facebook and your user generated content on YouTube is bound to produce duplicates. After all, how may users have posted their favorite Billie Eilish or Taylor Swift music video. AI doesn’t need 10000 versions of “Shake it Off” they probably just need the official video. Enter deduplication–which by definition means the platform knows what it has scraped and also knows what it wants to get rid of.

“Get rid of” is a relative concept. In many systems—particularly in storage environments like backup servers or object stores—deduplication means keeping only one physical copy of a file. Any other instances of that data don’t get stored again; instead, they’re represented by pointers to the original copy. This approach, known as inline deduplication, happens in real time and minimizes storage waste without actually deleting anything of functional value. It requires knowing what you have, knowing you have more than one version of the same thing, and being able to tell the system where to look to find the “original” copy without disturbing the process and burning compute inefficiently.

In other cases, such as post-process deduplication, the system stores data initially, then later scans for and eliminates redundancies. Again, the AI platform knows there are two or more versions of the same thing, say the book Being and Nothingness, knows where to find the copies and has been trained to keep only one version. Even here, the duplicates may not be permanently erased—they might be archived, versioned, or logged for auditing, compliance, or reconstruction purposes.

In AI training contexts, deduplication usually means removing redundant examples from the training set to avoid copyright risk. The duplicate content may be discarded from the training pipeline but often isn’t destroyed. Instead, AI companies may retain it in a separate filtered corpus or keep hashed fingerprints to ensure future models don’t retrain on the same material unknowingly.

So they know what they have, and likely know where it came from. They just don’t want to tell any plaintiffs.

Ultimately, deduplication is less about destruction and more about optimization. It’s a way to reduce noise, save resources, and improve performance—while still allowing systems to track, reference, or even rehydrate the original data if needed.

Its existence directly undermines claims that companies are unaware of which copyrighted works were ingested. Indeed, it only makes sense that one of the hidden consequences of the indiscriminate scraping that underpins large-scale AI training is the proliferation of duplicated data. Web crawlers ingest everything they can access—news articles republished across syndicates, forum posts echoed in aggregation sites, Wikipedia mirrors, boilerplate license terms, spammy SEO farms repeating the same language over and over. Without any filtering, this avalanche of redundant content floods the training pipeline.

This is where deduplication becomes not just useful, but essential. It’s the cleanup crew after a massive data land grab. The more messy and indiscriminate the scraping, the more aggressively the model must filter for quality, relevance, and uniqueness to avoid training inefficiencies or—worse—model behaviors that are skewed by repetition. If a model sees the same phrase or opinion thousands of times, it might assume it’s authoritative or universally accepted, even if it’s just a meme bouncing around low-quality content farms.

Deduplication is sort of the Winston Wolf of AI. And if the cleaner shows up, somebody had to order the cleanup. It is a direct response to the excesses of indiscriminate scraping. It’s both a technical fix and a quiet admission that the underlying data collection strategy is, by design, uncontrolled. But while the scraping may be uncontrolled to get copies of as much of your data has they can lay hands on, even by cleverly changing their terms of use boilerplate so they can do all this under the effluvia of legality, they send in the cleaner to take care of the crime scene.

So to summarize: To deduplicate, platforms must identify content-level matches (e.g., multiple copies of Being and Nothingness by Jean-Paul Sartre). This process requires tools that compare, fingerprint, or embed full documents—meaning the content is readable and classifiable–and, oh, yes, discoverable.

Platforms may choose the ‘cleanest’ copy to keep, showing knowledge and active decision-making about which version of a copyrighted work is retained. And–big finish–removing duplicates only makes sense if operators know which datasets they scraped and what those datasets contain.

Drilling down on a platform’s deduplication tools and practices may prove up knowledge and intent to a precise degree—contradicting arguments of plausible deniability in litigation. Johnny ate the cookies isn’t going to fly. There’s a market clearing level of record keeping necessary for deduping to work at all, so it’s likely that there are internal deduplication logs or tooling pipelines that are discoverable.

When AI platforms object to discovery about deduplication, plaintiffs can often overcome those objections by narrowing their focus. Rather than requesting broad details about how a model deduplicates its entire training set, plaintiffs should ask a simple, specific question: Were any of these known works—identified by title or author—deduplicated or excluded from training?

This approach avoids objections about overbreadth or burden. It reframes discovery as a factual inquiry, not a technical deep dive. If the platform claims the data was not retained, plaintiffs can ask for existing artifacts—like hash filters, logs, or manifests—or seek a sworn statement explaining the loss and when it occurred. That, in turn, opens the door to potential spoliation arguments.

If trade secrets are cited, plaintiffs can propose a protective order, limiting access to outside counsel or experts like we’ve done 100,000 times before in other cases. And if the defendant claims “duplicate” is too vague, plaintiffs can define it functionally—as content that’s identical or substantially similar, by hash, tokens, or vectors.

Most importantly, deduplication is relevant. If a platform identified a plaintiff’s work and trained on it anyway, that speaks to volitional use, copying, and lack of care—key issues in copyright and fair use analysis. And if they lied about it, particularly to the court—Helloooooo Harper & Row. Discovery requests that are focused, tailored, and anchored in specific works stand a far better chance of surviving objections and yielding meaningful evidence which hopefully will be useful and lead to other positive results.