[Following is based on an except from the Artist Rights Institute’s submission to the UK Intellectual Property Office consultation on a UK AI legislative proposal]
The fundamental element of any rights reservation regime is knowing which work is being blocked by which rights owner. This will require creating a metadata identification regime for all works of authorship, a regime that has never existed and must be created from whole cloth. As the IPO is aware, metadata for songs is quite challenging as was demonstrated in the IPO’s UK Industry Agreement on Music Streaming Metadata Working Groups.
Using machine-readable formats for reservations sounds like would be an easy fix, but it creates an enormous burden on the artist, i.e., the target of the data scraper, and is a major gift to the AI platform delivered by government. We can look to the experience with robots.txt for guidance.
Using a robots.txt file or similar “do not index” file puts far too big a bet on machines getting it right in the silence of the Internet. Big Tech has used this opt-out mantra for years in a somewhat successful attempt to fool lawmakers into believing that blocking is all so easy. If only there was a database, even a machine can do it. And yet there are still massive numbers of webpages copied and those pages that were copied for search (or the Internet Archive) are now being used to train AI.
It also must be said that a “disallow” signal is designed to work with file types or folders, not millions of song titles or sound recordings (see GEMA’s lawsuits against AI platforms). For example, this robots.txt code will recognize and block a “private-directory” folder but would otherwise allow Google to freely index the site while blocking Bing from indexing images:
User-agent: *
Disallow: /private-directory/
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Disallow: /images/
Theoretically, existing robots.txt files could be configured to block AI crawlers entirely by designating known crawlers as user-agents such as ChatGPT. However, there are many known defects when robots.txt can fail to block web crawlers or AI data scrapers including:
Malicious or non-compliant crawlers might ignore the rules in a robots.txt file and continue to scrape a website despite the directives.
Incorrect Syntax of a robots.txt file can lead to unintended results, such as not blocking the intended paths or blocking too many paths.
Issues with server configuration can prevent the robots.txt file from being correctly read or accessed by crawlers.
Content generated dynamically through JavaScript or AJAX requests might not be blocked if robots.txt is not properly configured to account for these resources.
Unlisted crawlers or scrapers not known to the user may not adhere to the intended rules.
Crawlers using cached versions of a site may bypass rules in a robots.txt file, particularly updated rules since the cache was created.
Subdomains and Subdirectories limiting the scope of the rules can lead to not blocking all intended subdomains or subdirectories.
Missing Entire Lists of Songs, Recordings, or Audiovisual works.
While robots.txt and similar techniques theoretically are useful tools for managing crawler access, they are not foolproof. Implementing additional security measures, such as IP blocking, CAPTCHA, rate limiting, and monitoring server logs, can help strengthen a site’s defenses against unwanted scraping. However, like the other tools that were supposed to level the playing field for artists against Big Tech, none of these tools are free, all of them require more programming knowledge than can reasonably be expected, all require maintenance, and at scale, all of them can be gamed or will eventually fail.
It must be said that all of the headaches and expense of keeping Big Tech out is because Big Tech so desperately wants to get in.
The difference between blocking a search engine crawler and an AI data scraper (which could each be operated by the same company in the case of Meta, Bing or Google) is that failing to block a search engine crawler is inconvenient for artists, but failing to block an AI data scraper is catastrophic for artists.
Even if the crawlers worked seamlessly, should any of these folders change names and the site admin forgets to change the robots.txt file, that is asking a lot of every website on the Internet.
It must also be said that pages using machine readable blocking tools may result in pages being downranked, particularly for AI platforms closely associated with search engines. Robots.txt blocking already has problems with crawlers and downranking for several reasons. A robots.txt file itself doesn’t directly cause pages to be downranked in search results. However, it can indirectly affect rankings by limiting search engine crawlers’ access to certain parts of a website. Here’s how:
Restricted Crawling: If you block crawlers from accessing important pages using robots.txt, those pages won’t be indexed. Without indexing, they won’t appear in search results, let alone rank.
Crawl Budget Mismanagement: For large websites, search engines allocate a “crawl budget”—the number of pages they crawl in a given time. If robots.txt doesn’t guide crawlers efficiently, that may randomly leave pages unindexed.
No Content Evaluation: If a page is blocked by robots.txt but still linked elsewhere, search engines might index its URL without evaluating its content. This can result in poor rankings since the page’s relevance and quality can’t be assessed.
The TDM safe harbor is too valuable and potentially too dangerous to leave to machines.