In an unprecedented move, the Japan Fair Trade Commission on Tuesday issued a cease-and-desist order against Google for violating the country's anti-monopoly law by forcing manufacturers to preinstall the company’s apps on their Android smartphones.https://t.co/ycsxX1s8tR
— Privacy Matters (@PrivacyMatters) April 16, 2025
Month: April 2025
Does “Publicly Available” AI Scraping Mean They Take Everything or Just Anything That’s Not Nailed Down?
Let’s be clear: It is not artificial intelligence as a technology that’s the existential threat. It’s the people who make the decisions about how to train and use artificial intelligence that are the existential threat. Just like nuclear power is not an existential threat, it’s the Czar Bomba that measured 50 megatons on the bangometer that’s the existential threat.
If you think that the tech bros can be trusted not to use your data scraped from their various consumer products for their own training purposes, please point to the five things they’ve done in the last 20 years that give you that confidence? Or point to even one thing.
Here’s an example. Back in the day when we were trying to build a library of audio fingerprints, we first had to rip millions of tracks in order to create the fingerprints. One employee who came to us from a company with a free email service said that there were millions of emails with audio file attachments just sitting there in users’ sent mail folders. Maybe we could just grab those audio files? Obviously that would be off limits for a host of reasons, but he didn’t see it. It’s not that he is an immoral person–immoral people recognize that there are some rules and they just want to break them. He was amoral–he didn’t see the rules and he didn’t think anything was wrong with his suggestion.
But the moral of the story–so to speak–is that I fully believe every consumer product is being scraped. That means that there’s a fairly good chance that Google, Microsoft, Meta/Facebook and probably other Big Tech players are using all of their consumer products to train AI. I would not bet against it.
If you think that’s crazy, I would suggest you think again. While these companies keep that kind of thing fairly quiet, it’s not the first time that the issue has come up–Big Tech telling you one thing, but using you to gain a benefit for something entirely different that you probably would never have agreed to had you known.
Take the Google Books saga. The whole point of Google’s effort at digitizing all the world’s books wasn’t because of some do-gooder desire to create the digital library of Alexandria or even the snippets that were the heart of the case. No–it was the “nondisplay uses” like training Google’s translation engine using “corpus machine translation”. The “corpus” of all the digitized books was the real value and of course was the main thing that Google wouldn’t share with the authors and didn’t want to discuss in the case.
Another random example would be “GOOG-411”. We can thank Marissa Meyer for spilling the beans on that one.
According to PC World back in 2010:
Google will close down 1-800-GOOG-411 next month, saying the free directory assistance service has served its purpose in helping the company develop other, more sophisticated voice-powered technologies.
GOOG-411, which will be unplugged on Nov. 12, was the search company’s first speech recognition service and led to the development of mobile services like Voice Search, Voice Input and Voice Actions.
Google, which recorded calls made to GOOG-411, has been candid all along about the motivations behind running the service, which provides phone numbers for businesses in the U.S. and Canada.
In 2007, Google Vice President of Search Products & User Experience Marissa Mayer said she was skeptical that free directory assistance could be viable business, but that she had no doubt that GOOG-411 was key to the company’s efforts to build speech-to-text services.
GOOG 411 is a prime example of how Big Tech plays the thimblerig, especially the “has been candid all along about the motivations behind running the service.” Doesn’t that phrase just ooze corporate flak? That, as we say in the trade, is a freaking lie.

None of the GOOG-411 collateral ever said, “Hey idiot, come help us get even richer by using our dumbass “free” directory assistance “service”.” Just like they’re not saying, “Hey idiot, use our “free” products so we can train our AI to take your job.” That’s the thimblerig, but played at our expense.
This subterfuge has big consequences for people like lawyers. As I wrote in my 2014 piece in Texas Lawyer:
“A lawyer’s duty to maintain the confidentiality of privileged communications is axiomatic. Given Google’s scanning and data mining capabilities, can lawyers using Gmail comply with that duty without their clients’ informed consent? In addition to scanning the text, senders and recipients, Google’s patents for its Gmail applications claim very broad functionality to scan file attachments. (The main patent is available on Google’s site. A good discussion of these patents is in Jeff Gould’s article, “The Natural History of Gmail Data Mining”, available on Medium.)”
Google has made a science of enticing users into giving up free data for Google to evolve even more products that may or may not be useful beyond the “free” part. Does the world really need another free email program? Maybe not, but Google does need a way to snarf down data for its artificial intelligence platforms–deceptively.
Fast forward ten years or so and here we are with the same problem–except it’s entirely possible that all of the Big Tech AI platforms are using their consumer products to train AI. Nothing has changed for lawyers, and some version of these rules would be prudent to follow for anyone with a duty of confidentiality like a doctor, accountant, stock broker or any of the many licensed professions. Not to mention social workers, priests, and the list goes on. If you call Big Tech on the deception and they will all say that they operate within their privacy policies, “de-identify” user data, only use “public” information, or other excuses.
I think the point of all this is that the platforms have far too many opportunities to cross-collateralize our data for the law to permit any confusion about what data they scrape.
What We Think We Know
Microsoft’s AI Training Practices
Microsoft has publicly stated that it does not use data from its Microsoft 365 products (e.g., Word, Excel, Outlook) to train its AI models. The company wants us to believe they rely on “de-identified” data from sources such as Bing searches, Copilot interactions, and “publicly available” information, whatever that means. Microsoft emphasizes its commitment to responsible AI practices, including removing metadata and anonymizing data to protect user privacy. See what I mean? Given Microsoft takes these precautions, that makes it all fine.
However, professionals using Microsoft’s tools must remain vigilant. While Microsoft claims not to use customer data from enterprise accounts for AI training, any inadvertent sharing of sensitive information through other Microsoft services (e.g., Bing or Copilot) could pose risks for users, particularly people with a duty of confidentiality like lawyers and doctors. And we haven’t even discussed child users yet.
Google’s AI Training Practices
For decades, Google has faced scrutiny for its data practices, particularly with products like Gmail, Google Docs, and Google Drive. Google’s updated privacy policy explicitly allows the use of “publicly available” information and user data for training its AI models, including Bard and Gemini. While Google claims to anonymize and de-identify data, concerns remain about the potential for sensitive information to be inadvertently included in training datasets.
For licensed professionals, these practices raise significant red flags. Google advises users not to input confidential or sensitive information into its AI-powered tools–typically Googlely. The risk of human reviewers accessing “de-identified” data can happen to anyone, but why in the world would you ever trust Google?
Does “Publicly Available” Mean Everything or Does it Mean Anything That’s Not Nailed Down?
These companies speak of “publicly available” data as if data that is publicly available is free to scrape and use for training. So what does that mean?
Based on the context and some poking around, it appears that there is no legally recognizable definition of what “publicly available” actually means. If you were going to draw a line between “publicly available” and the opposite, where would you draw it? You won’t be surprised to know that Big Tech will probably draw the line in an entirely different place than a normal person.
As far as I can tell, “publicly available” data would include data or content that is accessible by a data scraping crawler or by the general public without a subscription, payment, or special access permissions. This likely includes web pages, posts on social media like baby pictures on Facebook or Instagram, or other platforms that do not restrict access to their content through paywalls, registration requirements, or other barriers like terms of service prohibiting data scraping, API or a robots.txt file (which like a lot of other people including Ed Newton-Rex, I’m skeptical of even working).
While discussions of terms of service, notices prohibiting scraping and automated directions to crawlers sound good, in reality there’s no way to stop a determined crawler. The vulpine lust for data and cold hard cash by Big Tech is not realistically possible to stop at this point. Stopping the existential onslaught explains why the world needs to escalate punishment for these violations to a new level that may seem extreme at this point or at least unusually harsh.
Yet the massive and intentional copyright infringement, privacy violations, and who knows what else are so vast they are beyond civil penalties particularly for a defendant that seemingly prints money.
Machines Don’t Let Machines Do Opt Outs: Why robots.txt won’t get it done for AI Opt Outs
[Following is based on an except from the Artist Rights Institute’s submission to the UK Intellectual Property Office consultation on a UK AI legislative proposal]
The fundamental element of any rights reservation regime is knowing which work is being blocked by which rights owner. This will require creating a metadata identification regime for all works of authorship, a regime that has never existed and must be created from whole cloth. As the IPO is aware, metadata for songs is quite challenging as was demonstrated in the IPO’s UK Industry Agreement on Music Streaming Metadata Working Groups.
Using machine-readable formats for reservations sounds like would be an easy fix, but it creates an enormous burden on the artist, i.e., the target of the data scraper, and is a major gift to the AI platform delivered by government. We can look to the experience with robots.txt for guidance.
Using a robots.txt file or similar “do not index” file puts far too big a bet on machines getting it right in the silence of the Internet. Big Tech has used this opt-out mantra for years in a somewhat successful attempt to fool lawmakers into believing that blocking is all so easy. If only there was a database, even a machine can do it. And yet there are still massive numbers of webpages copied and those pages that were copied for search (or the Internet Archive) are now being used to train AI.
It also must be said that a “disallow” signal is designed to work with file types or folders, not millions of song titles or sound recordings (see GEMA’s lawsuits against AI platforms). For example, this robots.txt code will recognize and block a “private-directory” folder but would otherwise allow Google to freely index the site while blocking Bing from indexing images:
User-agent: *
Disallow: /private-directory/
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Disallow: /images/
Theoretically, existing robots.txt files could be configured to block AI crawlers entirely by designating known crawlers as user-agents such as ChatGPT. However, there are many known defects when robots.txt can fail to block web crawlers or AI data scrapers including:
Malicious or non-compliant crawlers might ignore the rules in a robots.txt file and continue to scrape a website despite the directives.
Incorrect Syntax of a robots.txt file can lead to unintended results, such as not blocking the intended paths or blocking too many paths.
Issues with server configuration can prevent the robots.txt file from being correctly read or accessed by crawlers.
Content generated dynamically through JavaScript or AJAX requests might not be blocked if robots.txt is not properly configured to account for these resources.
Unlisted crawlers or scrapers not known to the user may not adhere to the intended rules.
Crawlers using cached versions of a site may bypass rules in a robots.txt file, particularly updated rules since the cache was created.
Subdomains and Subdirectories limiting the scope of the rules can lead to not blocking all intended subdomains or subdirectories.
Missing Entire Lists of Songs, Recordings, or Audiovisual works.
While robots.txt and similar techniques theoretically are useful tools for managing crawler access, they are not foolproof. Implementing additional security measures, such as IP blocking, CAPTCHA, rate limiting, and monitoring server logs, can help strengthen a site’s defenses against unwanted scraping. However, like the other tools that were supposed to level the playing field for artists against Big Tech, none of these tools are free, all of them require more programming knowledge than can reasonably be expected, all require maintenance, and at scale, all of them can be gamed or will eventually fail.
It must be said that all of the headaches and expense of keeping Big Tech out is because Big Tech so desperately wants to get in.
The difference between blocking a search engine crawler and an AI data scraper (which could each be operated by the same company in the case of Meta, Bing or Google) is that failing to block a search engine crawler is inconvenient for artists, but failing to block an AI data scraper is catastrophic for artists.
Even if the crawlers worked seamlessly, should any of these folders change names and the site admin forgets to change the robots.txt file, that is asking a lot of every website on the Internet.
It must also be said that pages using machine readable blocking tools may result in pages being downranked, particularly for AI platforms closely associated with search engines. Robots.txt blocking already has problems with crawlers and downranking for several reasons. A robots.txt file itself doesn’t directly cause pages to be downranked in search results. However, it can indirectly affect rankings by limiting search engine crawlers’ access to certain parts of a website. Here’s how:
Restricted Crawling: If you block crawlers from accessing important pages using robots.txt, those pages won’t be indexed. Without indexing, they won’t appear in search results, let alone rank.
Crawl Budget Mismanagement: For large websites, search engines allocate a “crawl budget”—the number of pages they crawl in a given time. If robots.txt doesn’t guide crawlers efficiently, that may randomly leave pages unindexed.
No Content Evaluation: If a page is blocked by robots.txt but still linked elsewhere, search engines might index its URL without evaluating its content. This can result in poor rankings since the page’s relevance and quality can’t be assessed.
The TDM safe harbor is too valuable and potentially too dangerous to leave to machines.
TikTok Extended
Imagine if the original Napster had received TikTok-level attention from POTUS? Forget I said that. The ongoing divestment of TikTok from its parent company ByteDance has reached yet another critical point with yet another bandaid. Congress originally set a January 19, 2025 deadline for ByteDance to either sell TikTok’s U.S. operations or face a potential ban in the United States as part of the Protecting Americans from Foreign Adversary Controlled Applications Act or “PAFACA” (I guess “covfefe” was taken). The US Supreme Court upheld that law in TikTok v. Garland.
When January 20 came around, President Trump gave Bytedance an extension to April 5, 2025 by executive order. When that deadline came, President Trump granted an extension to the extension to the January 19 deadline by another executive order, providing additional time for ByteDance to finalize a deal to divest. The extended deadline now pushes the timeline for divestment negotiations to July 1, 2025.
This new extension is designed to allow for further negotiation time among ByteDance, potential buyers, and regulatory authorities, while addressing the ongoing trade issues and concerns raised by both the U.S. and Chinese governments.
It’s getting mushy, but I’ll take a stab at the status of the divestment process. I might miss someone as they’re all getting into the act.
I would point out that all these bids anticipate a major overhaul in how TikTok operates which—just sayin’—means it likely would no longer be TikTok as its hundreds of millions of users now know it. I went down this path with Napster, and I would just say that it’s a very big deal to change a platform that has inherent legal issues into one that satisfies a standard that does not yet exist. I always used the rule of thumb that changing old Napster to new Napster (neither of which had anything to do with the service that eventually launched with the “Napster” brand but bore no resemblance to original Napster or its DNA) would result in an initial loss of 90% of the users. Just sayin’.
Offers and Terms
Multiple parties have expressed interest in acquiring TikTok’s U.S. operations, but the terms of these offers remain fluid due to ongoing negotiations and the complexity of the deal. Key bidders include:
Bytedance Investors: According to Reuters, “the biggest non-Chinese investors in parent company ByteDance to up their stakes and acquire the short video app’s U.S. operations.” This would involve Susquehanna International Group, General Atlantic, and KKR. Bytedance looks like it retains a minority ownership position of less than 20%, which I would bet probably means 19.99999999% or something like that. Reuters describes this as the front runner bid, and I tend to buy into that characterization. From a cap table point of view, this would be the cleanest with the least hocus pocus. However, the Reuters story is based on anonymous sources and doesn’t say how the deal would address the data privacy issues (other than that Oracle would continue to hold the data), or the algorithm. Remember, Oracle has been holding the data and that evidently has been unsatisfactory to Congress which is how we got here. Nothing against Oracle, but I suspect this significant wrinkle will have to get fleshed out.
Lawsuit by Bidder Company Led by Former Myspace Executive: In a lawsuit in Florida federal court by TikTok Global LLC filed April 3, TikTok Global accuses ByteDance, TikTok Inc., and founder Yiming Zhang of sabotaging a $33 billion U.S. acquisition deal by engaging in fraud, antitrust violations, and breach of contract. The complaint alleges ByteDance misled regulators, misappropriated the “TikTok Global” brand, and conspired to maintain control of TikTok in violation of U.S. government directives. The suit brings six causes of action, including tortious interference and unjust enrichment, underscoring a complex clash over corporate deception and national security compliance.
Oracle and Walmart: This proposal, which nearly closed in 2024 (I guess), involved a sale of TikTok’s U.S. business to a consortium of U.S.-based companies, with Oracle managing data security and infrastructure. ByteDance was to retain a minority stake in the new entity. However, this deal has not closed, who knows why aside from competition and then there’s those trade tariffs and the need for approval from both U.S. and Chinese regulators who have to be just so chummy right at the moment.
AppLovin: A preliminary bid has been submitted by AppLovin, an adtech company, to acquire TikTok’s U.S. operations. It appears that AppLovin’s offer includes managing TikTok’s user base and revenue model, with a focus on ad-driven strategies, although further negotiations are still required. According to Pitchbook, “AppLovin is a vertically integrated advertising technology company that acts as a demand-side platform for advertisers, a supply-side platform for publishers, and an exchange facilitating transactions between the two. About 80% of AppLovin’s revenue comes from the DSP, AppDiscovery, while the remainder comes from the SSP, Max, and gaming studios, which develop mobile games. AppLovin announced in February 2025 its plans to divest from the lower-margin gaming studios to focus exclusively on the ad tech platform.” It’s a public company trading as APP and seems to be worth about $100 billion. Call me crazy, but I’m a bit suspicious of a public company with “lovin” in its name. A bit groovy for the complexity of this negotiation, but you watch, they’ll get the deal.
Amazon and Blackstone: Amazon and Blackstone have also expressed interest in acquiring TikTok or a stake in a TikTok spinoff in Blackstone’s case. These offers would likely involve ByteDance retaining a minority interest in TikTok’s U.S. operations, though specifics of the terms remain unclear. Remember, Blackstone owns HFA through SESAC. So there’s that.
Frank McCourt/Project Liberty: The “People’s Bid” for TikTok is spearheaded by Project Liberty, founded by Frank McCourt. This initiative aims to acquire TikTok and change its platform to prioritize user privacy, data control, and digital empowerment. The consortium includes notable figures such as Tim Berners-Lee, Kevin O’Leary, and Jonathan Haidt, alongside technologists and academics like Lawrence Lessig. This one gives me the creeps as readers can imagine; anything with Lessig in it is DOA for me.
The bid proposes migrating TikTok to a new open-source protocol to address concerns raised by Congress while preserving its creative essence. As of now, the consortium has raised approximately $20 billion to support this ambitious vision. Again, these people act like you can just put hundreds of millions of users on hold while this changeover happens. I don’t think so, but I’m not as smart as these city fellers.
PRC’s Reaction
The People’s Republic of China (PRC) has strongly opposed the forced sale of TikTok’s U.S. operations, so there’s that. PRC officials argue that such a divestment would be a dangerous precedent, potentially harming Chinese tech companies’ international expansion. And they’re not wrong about that, it’s kind of the idea. Furthermore, the PRC’s position seems to be that any divestment agreement that involves the transfer of TikTok’s algorithm to a foreign entity requires Chinese regulatory approval. Which I suspect would be DOA.
They didn’t just make that up– the PRC, through the Cyberspace Administration of China (CAC), owns a “golden share” in ByteDance’s main Chinese subsidiary. This 1% stake, acquired in 2021, grants the PRC significant influence over ByteDance including the ability to influence content and business strategies.
Unsurprisingly, ByteDance must ensure that the PRC government (i.e., the Chinese Communist Party) maintains control over TikTok’s core algorithm, a key asset for the company. PRC authorities have been clear that they will not approve any sale that results in ByteDance losing full control over TikTok’s proprietary technology, complicating the negotiations with prospective buyers.
So a pressing question is whether TikTok without the algorithm is really TikTok from the users experience. And then there’s that pesky issue of valuation—is TikTok with an unknown algo worth as much as TikTok with the proven, albeit awful, current algo.
Algorithm Lease Proposal
In an attempt to address both U.S. security concerns and the PRC’s objections, a novel solution has been proposed: leasing TikTok’s algorithm. Under this arrangement, ByteDance would retain ownership of the algorithm, while a U.S.-based company, most likely Oracle, would manage the operational side of TikTok’s U.S. business.
ByteDance would maintain control over its technology, while allowing a U.S. entity to oversee the platform’s operation within the U.S. The U.S. company would be responsible for ensuring compliance with U.S. data privacy laws and national security regulations, while ByteDance would continue to control its proprietary algorithm and intellectual property.
Under this leasing proposal, Oracle would be in charge of managing TikTok’s data security and ensuring that sensitive user data is handled according to U.S. regulations. This arrangement would allow ByteDance to retain its technological edge while addressing American security concerns regarding data privacy.
The primary concern is safeguarding user data rather than the algorithm itself. The proposal aims to address these concerns while avoiding the need for China’s approval of a full sale.
Now remember, the reason we are in this situation at all is that Chinese law requires TikTok to turn over on demand any data it gathers on TikTok users which I discussed on MTP back in 2020. The “National Intelligence Law” even requires TikTok to allow the PRC’s State Security police to take over the operation of TikTok for intelligence gathering purposes on any aspect of the users’ lives. And if you wonder what that really means to the CCP, I have a name for you: Jimmy Lai. You could ask that Hong Konger, but he’s in prison.
This leasing proposal has sparked debate because it doesn’t seem to truly remove ByteDance’s influence over TikTok (and therefore the PRC’s influence). It’s being compared to “Project Texas 2.0,” a previous plan to secure TikTok’s data and operations. I’m not sure how the leasing proposal solves this problem. Or said another way, if the idea is to get the PRC’s hands off of Americans’ user data, what the hell are we doing?
Next Steps
As the revised deadline approaches, I’d expect a few steps, each of which has its own steps within steps:
Finalization of a Deal: This is the biggest one–easy to say, nearly impossible to accomplish. ByteDance will likely continue negotiating with interested parties while they snarf down user data, working to secure an agreement that satisfies both U.S. regulatory requirements and Chinese legal constraints. The latest extension provides runway for both sides to close key issues that are closable, particularly concerning the algorithm lease and ByteDance’s continued role in the business.
Operational Contingency: I suppose at some point the buyer is going to be asked if whatever their proposal is will actually function and whether the fans will actually stick around to justify whatever the valuation is. One of the problems with rich people getting ego involved in a fight over something they think is valuable is that they project all kinds of ideas on it that show how smart they are, only to find that once they get the thing they can’t actually do what they thought they would do. By the time they figure out that it doesn’t work, they’ve moved on to the next episode in Short Attention Span Theater and it’s called Myspace.
China’s Approval: ByteDance will need to secure approval from PRC regulatory authorities for any deal involving the algorithm lease or a full divestment. So why introduce the complexity of the algo lease when you have to go through that step anyway? Without PRC approval, any sale or lease of TikTok’s technology is likely dead, or at best could face significant legal and diplomatic hurdles.
Legal Action: If an agreement is not reached by the new deadline of July 1, 2025, further legal action could be pursued, either by ByteDance to contest the divestment order or by the U.S. government to enforce a ban on TikTok’s operations. I doubt that President Trump is going to keep extending the deadline if there’s no significant progress.
If I were a betting man, I’d bet on the whole thing collapsing into a shut down and litigation, but watch this space.