Machines Don’t Let Machines Do Opt Outs: Why robots.txt won’t get it done for AI Opt Outs

[Following is based on an except from the Artist Rights Institute’s submission to the UK Intellectual Property Office consultation on a UK AI legislative proposal]

The fundamental element of any rights reservation regime is knowing which work is being blocked by which rights owner.  This will require creating a metadata identification regime for all works of authorship, a regime that has never existed and must be created from whole cloth.  As the IPO is aware, metadata for songs is quite challenging as was demonstrated in the IPO’s UK Industry Agreement on Music Streaming Metadata Working Groups.

Using machine-readable formats for reservations sounds like would be an easy fix, but it creates an enormous burden on the artist, i.e., the target of the data scraper, and is a major gift to the AI platform delivered by government.  We can look to the experience with robots.txt for guidance.

Using a robots.txt file or similar “do not index” file puts far too big a bet on machines getting it right in the silence of the Internet.  Big Tech has used this opt-out mantra for years in a somewhat successful attempt to fool lawmakers into believing that blocking is all so easy.  If only there was a database, even a machine can do it.  And yet there are still massive numbers of webpages copied and those pages that were copied for search (or the Internet Archive) are now being used to train AI.  

It also must be said that a “disallow” signal is designed to work with file types or folders, not millions of song titles or sound recordings (see GEMA’s lawsuits against AI platforms). For example, this robots.txt code will recognize and block a “private-directory” folder but would otherwise allow Google to freely index the site while blocking Bing from indexing images:

User-agent: *

Disallow: /private-directory/

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Disallow: /images/

Theoretically, existing robots.txt files could be configured to block AI crawlers entirely by designating known crawlers as user-agents such as ChatGPT.  However, there are many known defects when robots.txt can fail to block web crawlers or AI data scrapers including:

Malicious or non-compliant crawlers might ignore the rules in a robots.txt file and continue to scrape a website despite the directives.

Incorrect Syntax of a robots.txt file can lead to unintended results, such as not blocking the intended paths or blocking too many paths.

Issues with server configuration can prevent the robots.txt file from being correctly read or accessed by crawlers.

Content generated dynamically through JavaScript or AJAX requests might not be blocked if robots.txt is not properly configured to account for these resources.

Unlisted crawlers or scrapers not known to the user may not adhere to the intended rules.

Crawlers using cached versions of a site may bypass rules in a robots.txt file, particularly updated rules since the cache was created.

Subdomains and Subdirectories limiting the scope of the rules can lead to not blocking all intended subdomains or subdirectories.

Missing Entire Lists of Songs, Recordings, or Audiovisual works.

While robots.txt and similar techniques theoretically are useful tools for managing crawler access, they are not foolproof. Implementing additional security measures, such as IP blocking, CAPTCHA, rate limiting, and monitoring server logs, can help strengthen a site’s defenses against unwanted scraping.  However, like the other tools that were supposed to level the playing field for artists against Big Tech, none of these tools are free, all of them require more programming knowledge than can reasonably be expected, all require maintenance, and at scale, all of them can be gamed or will eventually fail. 

 It must be said that all of the headaches and expense of keeping Big Tech out is because Big Tech so desperately wants to get in.

The difference between blocking a search engine crawler and an AI data scraper (which could each be operated by the same company in the case of Meta, Bing or Google) is that failing to block a search engine crawler is inconvenient for artists, but failing to block an AI data scraper is catastrophic for artists.

Even if the crawlers worked seamlessly, should any of these folders change names and the site admin forgets to change the robots.txt file, that is asking a lot of every website on the Internet.

It must also be said that pages using machine readable blocking tools may result in pages being downranked, particularly for AI platforms closely associated with search engines.  Robots.txt blocking already has problems with crawlers and downranking for several reasons. A robots.txt file itself doesn’t directly cause pages to be downranked in search results. However, it can indirectly affect rankings by limiting search engine crawlers’ access to certain parts of a website. Here’s how:

Restricted Crawling: If you block crawlers from accessing important pages using robots.txt, those pages won’t be indexed. Without indexing, they won’t appear in search results, let alone rank.

Crawl Budget Mismanagement: For large websites, search engines allocate a “crawl budget”—the number of pages they crawl in a given time. If robots.txt doesn’t guide crawlers efficiently, that may randomly leave pages unindexed.

No Content Evaluation: If a page is blocked by robots.txt but still linked elsewhere, search engines might index its URL without evaluating its content. This can result in poor rankings since the page’s relevance and quality can’t be assessed.

The TDM safe harbor is too valuable and potentially too dangerous to leave to machines.

The Delay’s The Thing: Anthropic Leapfrogs Its Own November Valuation Despite Litigation from Authors and Songwriters in the Heart of Darkness

If you’ve read Joseph Conrad’s Heart of Darkness, you’ll be familiar with the Congo Free State, a private colony of Belgian King Leopold II that is today largely the Democratic Republic of the Congo. When I say “private” I mean literally privately owned by his Leopoldness. Why would old King Leo be so interested in owning a private colony in Africa? Why for the money, of course. Leo had to move some pieces around the board and get other countries to allow him to get away with essentially “buying” the place, if “buying” is the right description.

So Leo held an international conference in Berlin to discuss the idea and get international buy-in, kind of like the World Economic Forum with worse food and no skiing. Rather than acknowledging his very for-profit intention to ravage the Congo for ivory (aka slaughtering elephants) and rubber (the grisly extraction of which was accomplished by uncompensated slave labor) with brutal treatment of all concerned, Leo convinced the assembled nations that his intentions were humanitarian and philanthropic. You know, don’t be evil. Just lie.

Of course, however much King Leopold may have foreshadowed our sociopathic overlords from Silicon Valley, it must be said that Leo’s real envy won’t so much be the money as what he could have done with AI himself had he only known. Oh well, he just had to make do with Kurtz.

Which bring us to AI in general and Anthropic in particular. Anthropic’s corporate slogan is equally humanitarian and philanthropic: “Anthropic is an AI research company that focuses on the safety and alignment of AI systems with human values.” Oh yes, all very jolly.

All very innocent and high minded, until you get punched in the face (to coin a phrase). It turns out–quelle horreur–that Anthropic stands accused of massive copyright infringement rather than lauded for its humanitarianism. Even more shocking? The company’s valuation is going through the stratosphere! These innocents surely must be falsely accused! The VC’s are voting with their bucks, so they wouldn’t put their shareholders’ money or limiteds money on the line for a–RACKETEER INFLUENCED CORRUPT ORGANIZATION?!?

Not only have authors brought this class action against Anthropic which is both Google’s stalking horse and cats paw to mix a metaphor, but the songwriters and music publishers have sued them as well. Led by Concord and Universal, the publishers have sued for largely the same reasons as the authors but for their quite distinct copyrights.

So let’s understand the game that’s being played here–as the Artist Rights Institute submitted in a comment to the UK Intellectual Property Office in the IPO’s current consultation on AI and copyright, the delay is the thing. And thanks to Anthropic, we can now put a valuation on the delay since the $4,000,000,000 the company raised in November 2024: $3,500,000,000. This one company is valued at $61.5 billion, roughly half of the entire creative industries in the UK and roughly equal to the entire U.S. music industry. No wonder delay is their business model.

However antithetical, copyright and AI must be discussed together for a very specific reason:  Artificial intelligence platforms operated by Google, Microsoft/OpenAI, Meta and the like have scraped and ingested works of authorship from baby pictures to Sir Paul McCartney as fast and as secretly as possible.  And the AI platforms know that the longer they can delay accountability, the more of the world’s culture they will have devoured—or as they might say, the more data they will have ingested.  And Not to mention the billions in venture capital they will have raised, just like Anthropic. For the good of humanity, of course, just like old King Leo.

As the Hon. Alison Hume, MP recently told Parliament, this theft is massive and has already happened, another example of why any “opt out” scheme (as had been suggested by the UK government) has failed before it starts:

This week, I discovered that the subtitles from one of my episodes of New Tricks have been scraped and are being used to create learning materials for artificial intelligence.  Along with thousands of other films and television shows, my original work is being used by generative AI to write scripts which one day may replace versions produced by mere humans like me.

This is theft, and it’s happening on an industrial scale.  As the law stands, artificial intelligence companies don’t have to be transparent about what they are stealing.[1]

Any delay[2] in prosecuting AI platforms simply increases their de facto “text and data mining” safe harbor while they scrape ever more of world culture.  As Ms. Hume states, this massive “training” has transferred value to these data-hungry mechanical beasts to a degree that confounds human understanding of its industrial scale infringement.  This theft dwarfs even the Internet piracy that drove broadband penetration, Internet advertising and search platforms in the 1999-2010 period.  It must be said that for Big Tech, commerce and copyright are once again inherently linked for even greater profit.

As the Right Honourable Baroness Kidron said in her successful opposition to the UK Government’s AI legislation in the House of Lords:

The Government are doing this not because the current law does not protect intellectual property rights, nor because they do not understand the devastation it will cause, but because they are hooked on the delusion that the UK’s best interests and economic future align with those of Silicon Valley.[3]  

Baroness Kidron identifies a question of central importance that mankind is forced to consider by the sheer political brute force of the AI lobbying steamroller:  What if AI is another bubble like the Dot Com bubble?  AI is, to a large extent, a black box utterly lacking in transparency much less recordkeeping or performance metrics.  As Baroness Kidron suggests, governments and the people who elect them are making a very big bet that AI is not pursuing an ephemeral bubble like the last time.

Indeed, the AI hype has the earmarks of a bubble, just as the Dot Com bubble did.  Baroness Kidron also reminds us of these fallacious economic arguments surrounding AI:

The Prime Minister cited an IMF report that claimed that, if fully realised, the gains from AI could be worth up to an average of £47 billion to the UK each year over a decade. He did not say that the very same report suggested that unemployment would increase by 5.5% over the same period. This is a big number—a lot of jobs and a very significant cost to the taxpayer. Nor does that £47 billion account for the transfer of funds from one sector to another. The creative industries contribute £126 billion per year to the economy. I do not understand the excitement about £47 billion when you are giving up £126 billion.[4]  

As Hon. Chris Kane, MP said in Parliament,  the Government runs the risk of enabling a wealth transfer that itself is not producing new value but would make old King Leo feel right at home: 

Copyright protections are not a barrier to AI innovation and competition, but they are a safeguard for the work of an industry worth £125 billion per year, employing over two million people.  We can enable a world where much of this value  is transferred to a handful of big tech firms or we can enable a win-win situation for the creative industries and AI developers, one where they work together based on licensed relationships with remuneration and transparency at its heart.


[1] Paul Revoir, AI companies are committing ‘theft’ on an ‘industrial scale’, claims Labour MP – who has written for TV series including New Tricks, Daily Mail (Feb. 12, 2025) available at https://www.dailymail.co.uk/news/article-14391519/AI-companies-committing-theft-industrial-scale-claims-Labour-MP-wrote-TV-shows-including-New-Tricks.html

[2] See, e.g., Kerry Muzzey, [YouTube Delay Tactics with DMCA Notices], Twitter (Feb. 13, 2020) available at https://twitter.com/kerrymuzzey/status/1228128311181578240  (Film composer with Content ID account notes “I have a takedown pending against a heavily-monetized YouTube channel w/a music asset that’s been fine & in use for 7 yrs & 6 days. Suddenly today, in making this takedown, YT decides “there’s a problem w/my metadata on this piece.” There’s no problem w/my metadata tho. This is the exact same delay tactic they threw in my way every single time I applied takedowns against broadcast networks w/monetized YT channels….And I attached a copy of my copyright registration as proof that it’s just fine.”); Zoë Keating, [Content ID secret rules], Twitter (Feb. 6. 2020) available at https://twitter.com/zoecello/status/1225497449269284864  (Independent artist with Content ID account states “[YouTube’s Content ID] doesn’t find every video, or maybe it does but then it has selective, secret rules about what it ultimately claims for me.”).

[3] The Rt. Hon. Baroness Kidron, Speech regarding Data (Use and Access) Bill [HL] Amendment 44A, House of Lords (Jan. 28, 2025) available at https://hansard.parliament.uk/Lords%E2%80%8F/2025-01-28/debates/9BEB4E59-CAB1-4AD3-BF66-FE32173F971D/Data(UseAndAccess)Bill(HL)#contribution-9A4614F3-3860-4E8E-BA1E-53E932589CBF 

[4] Id.