AI training datasets – Music Tech Solutions

As generative music systems like Suno and Udio move into the center of copyright debates, one question keeps coming up: Can we actually tell which songs influenced an AI-generated track? And then can we use that determination in a host of other processes like royalty payments?

Recently a number of people have pointed to research from Sony AI as evidence that the answer might be yes. Sony has publicly discussed work on tools designed to analyze the relationship between training data and AI-generated music outputs.

But the reality is a little more nuanced. Sony’s work is interesting and potentially important—but it is often misunderstood. What Sony has described is not a magic detector that can listen to a generated song and instantly reveal every recording the model trained on.

Instead, Sony is describing something more modest—and in some ways more useful.

Let’s unpack what the technology appears to do right now.

Two Problems Sony Is Trying to Solve

Sony AI has publicly discussed research in two related areas.

The first is training-data attribution. This means trying to estimate which recordings in a model’s training dataset influenced a generated output.

The second is musical similarity or version matching. This involves detecting when two pieces of music share meaningful musical material even if they are not exact copies of each other.

Sony has framed both efforts as research directions rather than a finished commercial product. In other words, this is still a developing technical approach, not a turnkey system that can produce definitive copyright answers.

Training Data Attribution in Plain English

The most relevant Sony work is a research project titled Large-Scale Training Data Attribution for Music Generative Models via Unlearning.

That title sounds intimidating, but the basic idea is fairly intuitive and also suggests the project is part of the broader machine unlearning academic discipline.

The system does not operate like Shazam. It does not simply listen to an AI-generated song and say:

“This track was trained on Song X, Song Y, and Song Z.”

Instead, the approach works more like this.

Imagine you already know—or at least suspect—which recordings were used to train the model. You have a candidate set of training tracks.

The system then asks:

Among these training recordings, which ones seem most likely to have influenced this generated output?

In other words, the system ranks influence among known candidates.

The research approach borrows from an area of machine learning called machine unlearning, which studies how particular training examples affect a model’s behavior. In simplified terms, researchers can test how the model behaves when certain training examples are removed or adjusted. If the output changes meaningfully, that suggests those examples had measurable influence.

The important point is that this is an influence-ranking tool, not a forensic detector.

It tries to answer:

“Which of these known training tracks mattered most?”

Not:

“Tell me every song the model was trained on.”

Sony’s Other Idea: Smarter Music Comparison

Sony has also described work on musical similarity detection.

Traditional audio fingerprinting systems—like those used by Shazam or Audible Magic—are very good at identifying identical recordings. If you upload the same song or a slightly altered version, the system can match it.

But generative AI raises a different problem. An AI output might resemble a song musically without copying the recording itself.

Sony’s research tries to detect those kinds of relationships.

For example, a system might notice that two tracks share melodic fragments, rhythmic patterns, harmonic progressions, or musical phrases even if the arrangement, production, or instrumentation is different.

In plain English, this kind of tool tries to answer a different question:

“Are these two pieces of music related in substance?”

Not:

“Are they the exact same recording?”

The Big Limitation: You Still Need the Training Dataset

Here’s the key limitation that often gets overlooked.

Sony’s attribution approach appears to depend on having access to the candidate training dataset.

The system works by comparing a generated output against recordings that are already known or suspected to have been used during training. It estimates influence among those candidates.

That means the system answers the question:

“Which of these training tracks influenced the output?”

But it does not answer the question:

“What unknown recordings were used to train this model?”

If the training corpus is hidden or undisclosed, the attribution system has nothing to test against.

This makes the technology conceptually similar to many machine-learning research experiments, which measure influence using known datasets. Researchers can test influence among known training examples, but they cannot reconstruct an unknown dataset from outputs alone.

What This Could Look Like in the Real World

If the training corpus were known, a practical workflow might look like this.

First, the recordings in the training corpus would be identified. Audio fingerprinting systems could match those recordings to commercial releases.

That step answers the question:

What copyrighted recordings appear in the training data?

Then an attribution tool like the one Sony describes could be used to analyze generated outputs and estimate which of those known recordings appear to have influenced them.

This would not prove copying in every case. But it could dramatically narrow the analysis—from millions of possible influences to a smaller list of likely candidates.

What Sony Has Not Claimed

Sony’s public statements do not suggest that the attribution problem is solved.

Sony has not announced a system that automatically calculates track-by-track royalty payments for AI-generated songs. Nor has it described a tool that conclusively proves copyright copying from an AI output alone.

Instead, the work is framed as research aimed at improving transparency and accountability in generative music systems.

Why Labels Might Still Be Interested

Even with these limitations, the idea could be attractive to rights holders.

If training datasets were known, attribution tools could theoretically support new ways of analyzing how music catalogs interact with generative AI systems.

For example, such tools might help support:

royalty allocation models
influence-weighted compensation frameworks
catalog analytics
AI audit trails showing how repertoire contributes to model behavior

In other words, the technology could potentially become a measurement tool for how music catalogs influence generative systems.

What Sony did and did not do (yet)

Sony’s work does not magically reveal every song an AI model trained on. And it does not eliminate the need to know what is in the training dataset.

Instead, its value appears to lie after the training data is known.

Once you have a candidate training corpus, tools like the ones Sony describes may help analyze which recordings influenced particular outputs.

That makes the technology best understood as a post-disclosure attribution layer, not a substitute for knowing what recordings were used in training in the first place.

Music Tech Solutions

Chris Castle's Solutions for Music-Tech Entrepreneurs

Tag: AI training datasets

Sony’s AI Music Attribution Tool: What It Actually Does (and What It Doesn’t)

Two Problems Sony Is Trying to Solve

Training Data Attribution in Plain English

Sony’s Other Idea: Smarter Music Comparison

The Big Limitation: You Still Need the Training Dataset

What This Could Look Like in the Real World

What Sony Has Not Claimed

Why Labels Might Still Be Interested

What Sony did and did not do (yet)

Menu

Music Tech Solutions

Chris Castle's Solutions for Music-Tech Entrepreneurs

Two Problems Sony Is Trying to Solve

Training Data Attribution in Plain English

Sony’s Other Idea: Smarter Music Comparison

The Big Limitation: You Still Need the Training Dataset

What This Could Look Like in the Real World

What Sony Has Not Claimed

Why Labels Might Still Be Interested

What Sony did and did not do (yet)

Share this:

Menu