Shilling Like It’s 1999: Ars, Anthropic, and the Internet of Other People’s Things

Ars Technica just ran a piece headlined “AI industry horrified to face largest copyright class action ever certified.”

It’s the usual breathless “innovation under siege” framing—complete with quotes from “public interest” groups that, if you check the Google Shill List submitted to Judge Alsup in the Oracle case and Public Citizen’s Mission Creep-y, have long been in the paid service of Big Tech. Judge Alsup…hmmm…isn’t he the judge in the very Anthropic case that Ars is going on about?

Here’s what Ars left out: most of these so-called advocacy outfits—EFF, Public Knowledge, CCIA, and their cousins—have been doing Google’s bidding for years, rebranding corporate priorities as public interest. It’s an old play: weaponize the credibility of “independent” voices to protect your bottom line.

The article parrots the industry’s favorite excuse: proving copyright ownership is too hard, so these lawsuits are bound to fail. That line would be laughable if it weren’t so tired; it’s like elder abuse. We live in the age of AI deduplication, manifest checking, and robust content hashing—technologies the AI companies themselves use daily to clean, track, and optimize their training datasets. If they can identify and strip duplicates to improve model efficiency, they can identify and track copyrighted works. What they mean is: “We’d rather not, because it would expose the scale of our free-riding.”

That’s the unspoken truth behind these lawsuits. They’re not about “stifling innovation.” They’re about holding accountable an industry that’s built its fortunes on what can only be called the Internet of Other People’s Things—a business model where your creative output, your data, and your identity are raw material for someone else’s product, without permission, payment, or even acknowledgment.

Instead of cross-examining these corporate talking points like you know…journalists…Ars lets them pass unchallenged, turning what could have been a watershed moment for transparency into a PR assist. That’s not journalism—it’s message laundering.

The lawsuit doesn’t threaten the future of AI. It threatens the profitability of a handful of massive labs—many backed by the same investors and platforms that bankroll these “public interest” mouthpieces. If the case succeeds, it could force AI companies to abandon the Internet of Other People’s Things and start building the old-fashioned way: by paying for what they use.

Come on, Ars. Do we really have to go through this again? If you’re going to quote industry-adjacent lobbyists as if they were neutral experts, at least tell readers who’s paying the bills. Otherwise, it’s just shilling like it’s 1999.

When Viceroy David Sacks Writes the Tariffs: How One VC Could Weaponize U.S. Trade Against the EU

David Sacks is a “Special Government Employee”, Silicon Valley insider and a PayPal mafioso who has become one of the most influential “unofficial” architects of AI policy under the Trump administration. No confirmation hearings, no formal role—but direct access to power.

He:
– Hosts influential political podcasts with Musk and Thiel-aligned narratives.
– Coordinates behind closed doors with elite AI companies who are now PRC-style “national champions” (OpenAI, Anthropic, Palantir).
– Has reportedly played a central role in shaping the AI Executive Orders and industrial strategy driving billions in public infrastructure to favored firms.

Under 18 U.S.C. § 202(a), a Special Government Employee is:

  • Temporarily retained to perform limited government functions,
  • For no more than 130 days per year (which for Sacks ends either April 14 or May 30, 2025), unless reappointed in a different role,
  • Typically serves in an advisory or consultative role, or
  • Without holding actual decision-making or operational authority over federal programs or agencies.

SGEs are used to avoid conflict-of-interest entanglements for outside experts while still tapping their expertise for advisory purposes. They are not supposed to wield sweeping executive power or effectively run a government program. Yeah, right.

And like a good little Silicon Valley weasel, Sacks supposedly is alternating between his DC side hustle and his VC office to stay under 130 days. This is a dumbass reading of the statute which says “‘Special Government employee’ means… any officer or employee…retained, designated, appointed, or employed…to perform…temporary duties… for not more than 130 days during any period of 365 consecutive days.” That’s not the same as “worked” 130 days on the time card punch. But oh well.

David Sacks has already exceeded the legal boundaries of his appointment as a Special Government Employee (SGE) both in time served but also by directing the implementation of a sweeping, whole-of-government AI policy, including authoring executive orders, issuing binding directives to federal agencies, and coordinating interagency enforcement strategies—actions that plainly constitute executive authority reserved for duly appointed officers under the Appointments Clause. As an SGE, Sacks is authorized only to provide temporary, nonbinding advice, not to exercise operational control or policy-setting discretion across the federal government. Accordingly, any executive actions taken at his direction or based on his advisement are constitutionally infirm as the unlawful product of an individual acting without valid authority, and must be deemed void as “fruit of the poisonous tree.”

Of course, one of the states that the Trump AI Executive Orders will collide with almost immediately is the European Union and its EU AI Act. Were they 51st? No that’s Canada. 52nd? Ah, right that’s Greenland. Must be 53rd.

How Could David Sacks Weaponize Trade Policy to Help His Constituents in Silicon Valley?

Here’s the playbook:

Engineer Executive Orders

Through his demonstrated access to Trump and senior White House officials, Sacks could promote executive orders under the International Emergency Economic Powers Act (IEEPA) or Section 301 of the Trade Act, aimed at punishing countries (like EU members) for “unfair restrictions” on U.S. AI exports or operations.

Something like this: “The European Union’s AI Act constitutes a discriminatory and protectionist measure targeting American AI innovation, and materially threatens U.S. national security and technological leadership.” I got your moratorium right here.

Leverage the USTR as a Blunt Instrument

The Office of the U.S. Trade Representative (USTR) can initiate investigations under Section 301 without needing new laws. All it takes is political will—and a nudge from someone like Viceroy Sacks—to argue that the EU’s AI Act discriminates against U.S. firms. See Canada’s “Tech Tax”. Gee, I wonder if Viceroy Sacks had anything to do with that one.

Redefine “National Security”

Sacks and his allies can exploit the Trump administration’s loose definition of “national security” claiming that restricting U.S. AI firms in Europe endangers critical defense and intelligence capabilities.

Smear Campaigns and Influence Operations

Sacks could launch more public campaigns against the EU like his attacks on the AI diffusion rule. According to the BBC, “Mr. Sacks cited the alienation of allies as one of his key arguments against the AI diffusion plan”. That’s a nice ally you got there, be a shame if something happened to it.

After all, the EU AI Act does what Sacks despises like protects artists and consumers, restricts deployment of high-risk AI systems (like facial recognition and social scoring), requires documentation of training data (which exposes copyright violations), and applies extraterritorially (meaning U.S. firms must comply even at home).

And don’t forget, Viceroy Sacks actually was given a portfolio that at least indirectly includes the National Security Council, so he can use the NATO connection to put a fine edge on his “industrial patriotism” just as war looms over Europe.

When Policy Becomes Personal

In a healthy democracy, trade retaliation should be guided by evidence, public interest, and formal process.

But under the current setup, someone like David Sacks can short-circuit the system—turning a private grievance into a national trade war. He’s already done it to consumers, wrongful death claims and copyright, why not join war lords like Eric Schmidt and really jack with people? Like give deduplication a whole new meaning.

When one man’s ideology becomes national policy, it’s not just bad governance.

It’s a broligarchy in real time.

Deduplication and Discovery: The Smoking Gun in the Machine

WINSTON

“Wipe up all those little pieces of brains and skull”

From Pulp Fiction, screenplay by Quentin Tarantino and Roger Avary

Deduplication—the process of removing identical or near-identical content from AI training data—is a critical yet often overlooked indicator that AI platforms actively monitor and curate their training sets. This is the kind of process that one would expect given the kind of “scrape, ready, aim” business practices that seems precisely the approach of AI platforms that have ready access to large amounts of fairly high quality data from users of other products placed into commerce by business affiliates or confederates of the AI platforms.

For example, Google Gemini could have access to gmail, YouTube, at least “publicly available” Google Docs, Google Translate, or Google for Education, and then of course one of the great scams of all time, Google Books. Microsoft uses Bing searches, MSN browsing, the consumer Copilot experience, and ad interactions. Amazon uses Alexa prompts, Facebook uses “public” posts and so on.

This kind of hoovering up of indiscriminate amounts of “data” in the form of your baby pictures posted on Facebook and your user generated content on YouTube is bound to produce duplicates. After all, how may users have posted their favorite Billie Eilish or Taylor Swift music video. AI doesn’t need 10000 versions of “Shake it Off” they probably just need the official video. Enter deduplication–which by definition means the platform knows what it has scraped and also knows what it wants to get rid of.

“Get rid of” is a relative concept. In many systems—particularly in storage environments like backup servers or object stores—deduplication means keeping only one physical copy of a file. Any other instances of that data don’t get stored again; instead, they’re represented by pointers to the original copy. This approach, known as inline deduplication, happens in real time and minimizes storage waste without actually deleting anything of functional value. It requires knowing what you have, knowing you have more than one version of the same thing, and being able to tell the system where to look to find the “original” copy without disturbing the process and burning compute inefficiently.

In other cases, such as post-process deduplication, the system stores data initially, then later scans for and eliminates redundancies. Again, the AI platform knows there are two or more versions of the same thing, say the book Being and Nothingness, knows where to find the copies and has been trained to keep only one version. Even here, the duplicates may not be permanently erased—they might be archived, versioned, or logged for auditing, compliance, or reconstruction purposes.

In AI training contexts, deduplication usually means removing redundant examples from the training set to avoid copyright risk. The duplicate content may be discarded from the training pipeline but often isn’t destroyed. Instead, AI companies may retain it in a separate filtered corpus or keep hashed fingerprints to ensure future models don’t retrain on the same material unknowingly.

So they know what they have, and likely know where it came from. They just don’t want to tell any plaintiffs.

Ultimately, deduplication is less about destruction and more about optimization. It’s a way to reduce noise, save resources, and improve performance—while still allowing systems to track, reference, or even rehydrate the original data if needed.

Its existence directly undermines claims that companies are unaware of which copyrighted works were ingested. Indeed, it only makes sense that one of the hidden consequences of the indiscriminate scraping that underpins large-scale AI training is the proliferation of duplicated data. Web crawlers ingest everything they can access—news articles republished across syndicates, forum posts echoed in aggregation sites, Wikipedia mirrors, boilerplate license terms, spammy SEO farms repeating the same language over and over. Without any filtering, this avalanche of redundant content floods the training pipeline.

This is where deduplication becomes not just useful, but essential. It’s the cleanup crew after a massive data land grab. The more messy and indiscriminate the scraping, the more aggressively the model must filter for quality, relevance, and uniqueness to avoid training inefficiencies or—worse—model behaviors that are skewed by repetition. If a model sees the same phrase or opinion thousands of times, it might assume it’s authoritative or universally accepted, even if it’s just a meme bouncing around low-quality content farms.

Deduplication is sort of the Winston Wolf of AI. And if the cleaner shows up, somebody had to order the cleanup. It is a direct response to the excesses of indiscriminate scraping. It’s both a technical fix and a quiet admission that the underlying data collection strategy is, by design, uncontrolled. But while the scraping may be uncontrolled to get copies of as much of your data has they can lay hands on, even by cleverly changing their terms of use boilerplate so they can do all this under the effluvia of legality, they send in the cleaner to take care of the crime scene.

So to summarize: To deduplicate, platforms must identify content-level matches (e.g., multiple copies of Being and Nothingness by Jean-Paul Sartre). This process requires tools that compare, fingerprint, or embed full documents—meaning the content is readable and classifiable–and, oh, yes, discoverable.

Platforms may choose the ‘cleanest’ copy to keep, showing knowledge and active decision-making about which version of a copyrighted work is retained. And–big finish–removing duplicates only makes sense if operators know which datasets they scraped and what those datasets contain.

Drilling down on a platform’s deduplication tools and practices may prove up knowledge and intent to a precise degree—contradicting arguments of plausible deniability in litigation. Johnny ate the cookies isn’t going to fly. There’s a market clearing level of record keeping necessary for deduping to work at all, so it’s likely that there are internal deduplication logs or tooling pipelines that are discoverable.

When AI platforms object to discovery about deduplication, plaintiffs can often overcome those objections by narrowing their focus. Rather than requesting broad details about how a model deduplicates its entire training set, plaintiffs should ask a simple, specific question: Were any of these known works—identified by title or author—deduplicated or excluded from training?

This approach avoids objections about overbreadth or burden. It reframes discovery as a factual inquiry, not a technical deep dive. If the platform claims the data was not retained, plaintiffs can ask for existing artifacts—like hash filters, logs, or manifests—or seek a sworn statement explaining the loss and when it occurred. That, in turn, opens the door to potential spoliation arguments.

If trade secrets are cited, plaintiffs can propose a protective order, limiting access to outside counsel or experts like we’ve done 100,000 times before in other cases. And if the defendant claims “duplicate” is too vague, plaintiffs can define it functionally—as content that’s identical or substantially similar, by hash, tokens, or vectors.

Most importantly, deduplication is relevant. If a platform identified a plaintiff’s work and trained on it anyway, that speaks to volitional use, copying, and lack of care—key issues in copyright and fair use analysis. And if they lied about it, particularly to the court—Helloooooo Harper & Row. Discovery requests that are focused, tailored, and anchored in specific works stand a far better chance of surviving objections and yielding meaningful evidence which hopefully will be useful and lead to other positive results.

AI’s Legal Defense Team Looks Familiar — Because It Is

If you feel like you’ve seen this movie before, you have.

Back in the 2003-ish runup to the 2005 MGM Studios, Inc. v. Grokster, Ltd. Supreme Court case, I met with the founder of one of the major p2p platforms in an effort to get him to go legal.  I reminded him that he knew there was all kinds of bad stuff that got uploaded to his platform.  However much he denied it, he was filtering it out and he was able to do that because he had the control over the content that he (and all his cohorts) denied he had.  

I reminded him that if this case ever went bad, someone was going to invade his space and find out exactly what he was up to. Just because the whole distributed p2p model (unlike Napster, by the way) was built to both avoid knowledge and be a perpetual motion machine, there was going to come a day when none of that legal advice was going to matter.  Within a few months the platform shut down, not because he didn’t want to go legal, but because he couldn’t, at least not without actually devoting himself to respecting other people’s rights.

Everything Old is New Again

Back in the early 2000s, peer-to-peer (P2P) piracy platforms claimed they weren’t responsible for the illegal music and videos flooding their networks. Today, AI companies claim they don’t know what’s in their training data. The defense is essentially the same: “We’re just the neutral platform. We don’t control the content.”  It’s that distorted view of the DMCA and Section 230 safe harbors that put many lawyers’ children through prep school, college and graduate school.

But just like with Morpheus, eDonkey, Grokster, and LimeWire, everyone knew that was BS because the evidence said otherwise — and here’s the kicker: many of the same lawyers are now running essentially the same playbook to defend AI giants.

The P2P Parallel: “We Don’t Control Uploads… Except We Clearly Do”

In the 2000s, platforms like Kazaa and LimeWire were like my little buddy–magically they  never had illegal pornography or extreme violence available to consumers, they prioritized popular music and movies, and filtered out the worst of the web

That selective filtering made it clear: they knew what was on their network. It wasn’t even a question of “should have known”, they actually knew and they did it anyway.  Courts caught on. 

In Grokster,  the Supreme Court side stepped the hosting issue and essentially said that if you design a platform with the intent to enable infringement, you’re liable.

The Same Playbook in the AI Era

Today’s AI platforms — OpenAI, Anthropic, Meta, Google, and others — essentially argue:
“Our model doesn’t remember where it learned [fill in the blank]. It’s just statistics.”

But behind the curtain, they:
– Run deduplication tools to avoid overloading, for example on copyrighted books
– Filter out NSFW or toxic content
– Choose which datasets to include and exclude
– Fine-tune models to align with somebody’s social norms or optics

This level of control shows they’re not ignorant — they’re deflecting liability just like they did with p2p.

Déjà Vu — With Many of the Same Lawyers

Many of the same law firms that defended Grokster, Kazaa, and other P2P pirate defendants as well as some of the ISPs are now representing AI companies—and the AI companies are very often some, not all, but some of the same ones that started screwing us on DMCA, etc., for the last 25 years.  You’ll see familiar names all of whom have done their best to destroy the creative community for big, big bucks in litigation and lobbying billable hours while filling their pockets to overflowing. 

The legal cadre pioneered the ‘willful blindness’ defense and are now polishing it up for AI, hoping courts haven’t learned the lesson.  And judging…no pun intended…from some recent rulings, maybe they haven’t.

Why do they drive their clients into a position where they pose an existential threat to all creators?  Do they not understand that they are creating a vast community of humans that really, truly, hate their clients?  I think they do understand, but there is a corresponding hatred of the super square Silicon Valley types who hate “Hollywood” right back.

Because, you know, information wants to be free—unless they are selling it.  And your data is their new oil. They apply this “ethic” not just to data, but to everything: books, news, music, images, and voice. Copyright? A speed bump. Terms of service? A suggestion. Artist consent? Optional.  Writing a song is nothing compared to the complexities of Biggest Tech.

Why do they do this?  OCPD Much?

Because control over training data is strategic dominance and these people are the biggest control freaks that mankind has ever produced.  They exhibit persistent and inflexible patterns of behavior characterized by an excessive need to control people, environments, and outcomes, often associated with traits of obsessive-compulsive personality disorder.  

So empathy will get you nowhere with these people, although their narcissism allows them to believe that they are extremely empathetic.  Pathetic, yes, empathetic, not so much.  

Pay No Attention to that Pajama Boy Behind the Curtain

The driving force behind AI is very similar to the driving force behind the Internet.   If pajama boy can harvest the world’s intellectual property and use it to train his proprietary AI model, he now owns a simulation of the culture he is not otherwise part of, and not only can he monetize it without sharing profits or credit, he can deny profits and credit to the people who actually created it.

So just like the heyday of Pirate Bay, Grokster & Co.  (and Daniel Ek’s pirate incarnation) the goal isn’t innovation. The goal is control over language, imagery, and the markets that used to rely on human creators.  This should all sound familiar if you were around for the p2p era.

Why This Matters

Like p2p platforms, it’s just not believable that the AI companies do know what’s in their models.  They may build their chatbot interface so that the public can’t ask the chatbot to blow the whistle on the platform operator, but that doesn’t mean  the company can’t tell what they are training on.  These operators have to be able to know what’s in the training materials and manipulate that data daily.  

They fingerprint, deduplicate, and sanitize their datasets. How else can they avoid having multiple copies of books, for example, that would be a compute nightmare.  They store “embeddings” in a way that they can optimize their AI to use only the best copy of any particular book.  They control the pipeline.

It’s not about the model’s memory. It’s about the platform’s intent and awareness.

If they’re smart enough to remove illegal content and prioritize clean data, they’re smart enough to be held accountable.

We’re not living through the first digital content crisis — just the most powerful one yet. The legal defenses haven’t changed much. But the stakes — for copyright, competition, and consumer protection — are much higher now.

Courts, Congress, and the public should recognize this for what it is: a recycled defense strategy in service of unchecked AI power. Eventually Grokster ran into Grokster— and all these lawyers are praying that there won’t be an AI version of the Grokster case.