Does “Publicly Available” AI Scraping Mean They Take Everything or Just Anything That’s Not Nailed Down?

Let’s be clear: It is not artificial intelligence as a technology that’s the existential threat. It’s the people who make the decisions about how to train and use artificial intelligence that are the existential threat. Just like nuclear power is not an existential threat, it’s the Czar Bomba that measured 50 megatons on the bangometer that’s the existential threat.

If you think that the tech bros can be trusted not to use your data scraped from their various consumer products for their own training purposes, please point to the five things they’ve done in the last 20 years that give you that confidence? Or point to even one thing.

Here’s an example. Back in the day when we were trying to build a library of audio fingerprints, we first had to rip millions of tracks in order to create the fingerprints. One employee who came to us from a company with a free email service said that there were millions of emails with audio file attachments just sitting there in users’ sent mail folders. Maybe we could just grab those audio files? Obviously that would be off limits for a host of reasons, but he didn’t see it. It’s not that he is an immoral person–immoral people recognize that there are some rules and they just want to break them. He was amoral–he didn’t see the rules and he didn’t think anything was wrong with his suggestion.

But the moral of the story–so to speak–is that I fully believe every consumer product is being scraped. That means that there’s a fairly good chance that Google, Microsoft, Meta/Facebook and probably other Big Tech players are using all of their consumer products to train AI. I would not bet against it.

If you think that’s crazy, I would suggest you think again. While these companies keep that kind of thing fairly quiet, it’s not the first time that the issue has come up–Big Tech telling you one thing, but using you to gain a benefit for something entirely different that you probably would never have agreed to had you known.

Take the Google Books saga. The whole point of Google’s effort at digitizing all the world’s books wasn’t because of some do-gooder desire to create the digital library of Alexandria or even the snippets that were the heart of the case. No–it was the “nondisplay uses” like training Google’s translation engine using “corpus machine translation”. The “corpus” of all the digitized books was the real value and of course was the main thing that Google wouldn’t share with the authors and didn’t want to discuss in the case.

Another random example would be “GOOG-411”. We can thank Marissa Meyer for spilling the beans on that one.

According to PC World back in 2010:

Google will close down 1-800-GOOG-411 next month, saying the free directory assistance service has served its purpose in helping the company develop other, more sophisticated voice-powered technologies.

GOOG-411, which will be unplugged on Nov. 12, was the search company’s first speech recognition service and led to the development of mobile services like Voice Search, Voice Input and Voice Actions.

Google, which recorded calls made to GOOG-411, has been candid all along about the motivations behind running the service, which provides phone numbers for businesses in the U.S. and Canada.

In 2007, Google Vice President of Search Products & User Experience Marissa Mayer said she was skeptical that free directory assistance could be viable business, but that she had no doubt that GOOG-411 was key to the company’s efforts to build speech-to-text services.

GOOG 411 is a prime example of how Big Tech plays the thimblerig, especially the “has been candid all along about the motivations behind running the service.” Doesn’t that phrase just ooze corporate flak? That, as we say in the trade, is a freaking lie.

None of the GOOG-411 collateral ever said, “Hey idiot, come help us get even richer by using our dumbass “free” directory assistance “service”.” Just like they’re not saying, “Hey idiot, use our “free” products so we can train our AI to take your job.” That’s the thimblerig, but played at our expense.

This subterfuge has big consequences for people like lawyers. As I wrote in my 2014 piece in Texas Lawyer:

“A lawyer’s duty to maintain the confidentiality of privileged communications is axiomatic. Given Google’s scanning and data mining capabilities, can lawyers using Gmail comply with that duty without their clients’ informed consent? In addition to scanning the text, senders and recipients, Google’s patents for its Gmail applications claim very broad functionality to scan file attachments. (The main patent is available on Google’s site. A good discussion of these patents is in Jeff Gould’s article, “The Natural History of Gmail Data Mining”, available on Medium.)”

Google has made a science of enticing users into giving up free data for Google to evolve even more products that may or may not be useful beyond the “free” part. Does the world really need another free email program? Maybe not, but Google does need a way to snarf down data for its artificial intelligence platforms–deceptively.

Fast forward ten years or so and here we are with the same problem–except it’s entirely possible that all of the Big Tech AI platforms are using their consumer products to train AI. Nothing has changed for lawyers, and some version of these rules would be prudent to follow for anyone with a duty of confidentiality like a doctor, accountant, stock broker or any of the many licensed professions. Not to mention social workers, priests, and the list goes on. If you call Big Tech on the deception and they will all say that they operate within their privacy policies, “de-identify” user data, only use “public” information, or other excuses.

I think the point of all this is that the platforms have far too many opportunities to cross-collateralize our data for the law to permit any confusion about what data they scrape.

What We Think We Know

Microsoft’s AI Training Practices

Microsoft has publicly stated that it does not use data from its Microsoft 365 products (e.g., Word, Excel, Outlook) to train its AI models. The company wants us to believe they rely on “de-identified” data from sources such as Bing searches, Copilot interactions, and “publicly available” information, whatever that means. Microsoft emphasizes its commitment to responsible AI practices, including removing metadata and anonymizing data to protect user privacy. See what I mean? Given Microsoft takes these precautions, that makes it all fine.

However, professionals using Microsoft’s tools must remain vigilant. While Microsoft claims not to use customer data from enterprise accounts for AI training, any inadvertent sharing of sensitive information through other Microsoft services (e.g., Bing or Copilot) could pose risks for users, particularly people with a duty of confidentiality like lawyers and doctors. And we haven’t even discussed child users yet.

Google’s AI Training Practices

For decades, Google has faced scrutiny for its data practices, particularly with products like Gmail, Google Docs, and Google Drive. Google’s updated privacy policy explicitly allows the use of “publicly available” information and user data for training its AI models, including Bard and Gemini. While Google claims to anonymize and de-identify data, concerns remain about the potential for sensitive information to be inadvertently included in training datasets.

For licensed professionals, these practices raise significant red flags. Google advises users not to input confidential or sensitive information into its AI-powered tools–typically Googlely. The risk of human reviewers accessing “de-identified” data can happen to anyone, but why in the world would you ever trust Google?

Does “Publicly Available” Mean Everything or Does it Mean Anything That’s Not Nailed Down?

These companies speak of “publicly available” data as if data that is publicly available is free to scrape and use for training. So what does that mean?

Based on the context and some poking around, it appears that there is no legally recognizable definition of what “publicly available” actually means. If you were going to draw a line between “publicly available” and the opposite, where would you draw it? You won’t be surprised to know that Big Tech will probably draw the line in an entirely different place than a normal person.

As far as I can tell, “publicly available” data would include data or content that is accessible by a data scraping crawler or by the general public without a subscription, payment, or special access permissions. This likely includes web pages, posts on social media like baby pictures on Facebook or Instagram, or other platforms that do not restrict access to their content through paywalls, registration requirements, or other barriers like terms of service prohibiting data scraping, API or a robots.txt file (which like a lot of other people including Ed Newton-Rex, I’m skeptical of even working).

While discussions of terms of service, notices prohibiting scraping and automated directions to crawlers sound good, in reality there’s no way to stop a determined crawler. The vulpine lust for data and cold hard cash by Big Tech is not realistically possible to stop at this point. Stopping the existential onslaught explains why the world needs to escalate punishment for these violations to a new level that may seem extreme at this point or at least unusually harsh.

Yet the massive and intentional copyright infringement, privacy violations, and who knows what else are so vast they are beyond civil penalties particularly for a defendant that seemingly prints money.



Chronology: The Week in Review: Search is Ripping Us Off Again, this time for AI; #FreeJimmyLai; the MLC loves the MLC almost as much as the NMPA

In case you think that infringement of copyrights in AI training happens only to rock stars, photographers, record companies or movie studios, let me give you a personal anecdote that might cause you to reconsider. I asked Bing’s Copilot AI tool to list the deal points in a record producer agreement, and this is what I got back:

Copilot

Certainly! When negotiating a producer agreement, several key points should be considered. Here are some common negotiation points from the artist’s perspective:

  1. Scope of Engagement:
    1. The producer is generally engaged for one album project at a time.
    1. Clarify the scope of work, including the number of songs or recordings to be produced.
  2. Budget and Compensation:
    1. Determine the budget for the project (e.g., $10,000 in your case).
    1. Agree on the producer’s compensation, which may include an upfront fee, points on sales, or a combination.
  3. Ownership and Rights:
    1. Address ownership of the master recordings:
      1. Will the artist own and control the masters?
      1. Does the producer retain any rights?
  4. Credits and Royalties:
    1. Specify how the producer will be credited on releases.
    1. Discuss any royalties or backend points for the producer.
  5. Recording Schedule:
    1. Set a timeline for recording sessions.
    1. Define the two-week period for your project.
  6. Logistics:
    1. Determine where the recordings will take place.
    1. Outline the artist’s approval process.
    1. Define the producer’s deliverables.
  7. Termination Clause:
    1. Include provisions for termination if either party breaches the agreement.
    1. Address payment for work completed up to that point.

Remember that each negotiation is unique, and it’s essential to tailor the agreement to your specific needs and circumstances. Legal advice is recommended to ensure compliance with local laws and industry standards1 3 4.

Source(s)

1. An Artist’s Guide to Negotiating Record Producer Agreements – Medium

2. Negotiating a Music Producer Agreement: 7 Key Issues

3. 5 Things to Look for in a Music Producer Agreement – Lawyer Drummer

4. How to Negotiate: Steps, Phrases & Tips to Win the Deal – Selling Signals

Notice that Copilot lists “Source(s)” and that source #1 is an article from Medium.

My article.

I’ve written a number of articles about negotiating producer agreements and I can categorically say I haven’t given permission or been asked to give permission for any of those articles to be scraped and used to train AI. So how did it come to be in Copilot’s training materials?

My guess is that Medium (and probably a lot of other sites) were crawled by Bing’s spider and that copies of Medium pages were stored in Bing’s (or Microsoft’s) data centers. When Microsoft launched its Copilot AI tool, one could imagine that the temptation would have been strong to use those copies as part of the training tools for Copilot. In fact, the Verge reports that “companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.”

Are you surprised then that two of the biggest operators in the AI space are the search engine operators Google and Microsoft? This is another example of how Big Tech helps itself to your data and work product without you even knowing it’s happening. So now what? Now I know I’m being ripped off, and I’m wondering if Medium is in on it.

The Verge tells us:

The ability to download, store, organize, and query the modern internet gives any company or developer something like the world’s accumulated knowledge to work with. In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. 

Ya think?

#FreeJimmyLai

In case you were wondering if the Chinese Communist Party is serious about using TikTok to collect data as they please, Hong Kong publisher Jimmy Lai is a perfect example of what happens when a company refuses to cooperate. The CCP took over Jimmy’s Apple Daily newspaper and arrested him. Jimmy has been in prison for three years and has still not come to “trial” (whatever that means under the CCP).

Blake Morgan sounds off in Hypebot on how TikTok uses music as a honeypot to hook innocent users

The MLC Reup

The MLC gets a five year review of how they are doing. The Copyright Office conducts that review of both the MLC and the DLC. The public (that’s you) gets a chance to weigh in now because the MLC and the DLC filed their respective written statements about their respective awesomeness.

One of the most interesting exchanges happened between Graham Davies (who essentially is the public face of the DLC and is the head of the Digital Media Association) and the NMPA. Graham offered muted criticism of the MLC which irked the easily irked NMPA who offered a rather scathing response.

The most interesting thing about the exchange was that it was the NMPA–not the MLC–that responded to Graham. Tell you anything?