Both Reddit and Twitter made policy changes over the last couple of months to restrict API access purportedly due to large increases in LLM training traffic.

Centralized Social Media

I was just refreshing myself on the Reddit situation timeline on Wikipedia . Reuters reported February 14 that Reddit was looking to IPO in the second half of 2023. April 18th brought the announcement that Reddit would be charging for API access after a 60-day notice period ( Reddit , The Verge ). Mods staged a unified protest, shutting down several of the top subreddits June 12-14 by making them private. Some subreddits stayed private after the 48-hour “blackout.” Understandably, Reddit was going to do whatever possible to get back to business-as-usual and keep moving toward IPO. Two days later on June 16th, mods got a message from Reddit basically saying “Open up or get replaced.” This reminds me of the Israelite captivity by Nebuchadrezzar of Babylon and how he set up Zedekiah as a puppet king. ( 2nd Kings 24:17 ) By the way, that didn’t turn out well for him later when he revolted. ( 2nd Kings 25 )

Based on the investigation of Gergely Orosz of the Pragmatic Engineer, it seems like there was a connection between Twitter’s rate limits and cost cutting on GCP ML services . Evidently those services are needed to provide the level of service people have expected from Twitter over the years.

Instagram has seized the opportunity to launch their Threads product. They have not imposed any limits, probably to draw Reddit and Twitter users towards their platform. One analysis I was listening to (pretty sure from the All In podcast) noted that Facebook and Instagram are both now image and video-heavy platforms with one-way posting. Interactive discussion, like on Reddit and Twitter, provides much better LLM training material. The Threads app data collection policy pointed out by Jack Dorsey highlights this.

The [Future?] Fediverse

The Fediverse promises a different model of data ownership. The content creators have more control over their data and can take it with them when switching hosts or servers. Mastodon (Twitter replacement), Lemmy (Reddit replacement), and Bluesky with its AT protocol are all apps in this category. It is too early to tell which ones will get long-term, mainstream adoption. Mastadon.help counts over 11,000 instances currently running.

Let’s look at various options for gathering LLM training data from a decentralized system.

Scrape the Fediverse servers

In this option, extracting the conversations requires connecting to each server. With over 11,000 Mastodon servers, that sounds like it would take quite the coordination. In reality, people would probably track the number of active users and quality of interactions and just pull from the top 5% of servers. There is probably a long tail of experimental servers that never take off. Even with a small subset of total servers, maintaining the data pipelines seems like a yak shave taking effort away from using the data for model training. It also takes more time and infrastructure to pull historical data as just getting from the current time forward is probably not enough for the training needs.

Badly secured servers could have the majority of traffic coming from bots instead of real users. What does it take to stop this? Anonymous access is easy to turn off, but the question is how much ML does it take to effectively stop data scraping by crawler accounts. Static validations like captchas or mobile device attestation via Private Access Tokens can prove human user at a point in time. Analytics like access pattern outlier detection can provide ongoing behavior monitoring in case a user is signing up manually on a real device and then selling accounts to a bot network. This could become a big cat-and-mouse game for people running servers.

Honestly, this route sounds like a lose-lose for the people running the infrastructure and the people trying to train the models. All hosting parties have to worry about bot management, and model trainers have to worry about data pipelines.

Add a node on the Fediverse to capture events

Organizations that want to train models could hook into the system and get streaming updates. At first glance, this doesn’t look very promising. The ActivityPub Follow activity seems like it is Actor specific and would require scraper accounts to follow everyone from whom they wanted to pull data. Mastodon has a “local timeline” that includes all posts on that server but can’t be subscribed to from another server. This is different than blockchain where every node on the chain has a full history. Social media has too much data streaming in to make that model tenable. It seems like option #1 where you must visit all servers and manage those pipelines.

Rely on a company to give you easily queryable, analytics-ready extracts

If the fediverse takes off, there will probably be companies that aggregate and sell access to a corpus for model training. This is similar to gateway products like Cloudflare Web3 for Ethereum and IPFS. These products provide live API access to blockchain-based systems. The model training providers would be slightly different in that they are serving optimized, compacted historical (up to the present) data sets. Two methods this data could be surfaced are file-based access and live query capability.

File-based sharing is the method taken by Snowflake Marketplace and AWS Data Exchange . Providers solely sharing data and not managing consumer compute vastly simplifies their business. In all, it is probably more cost-efficient to have one consumer compute layer than two compute layers - a query engine on the provider side and processing on the consumer side. Spark and other analytics tools are used to dealing with the S3 API. Spark SQL uses Catalogs to configure connections for querying Iceberg tables natively. Iceberg tables seem to be a popular abstraction for file-based data lakes when not using a more managed data layer like Snowflake. File-based sharing could also be done on one’s own without being a part of a data exchange, but it could be a bit more effort to onboard customers than if they have an already established pattern with one of the major data exchanges.

A query engine approach would probably be similar to Amazon Athena where customers can pay for managed query compute per GB scanned. This requires more multi-tenant infrastructure management on the part of the data provider. It can take a lot of engineering to cover the security and “noisy neighbor” issues when trying to run multi-tenant query infrastructure. The decision probably comes down to how much flexibly consumers want when consuming the data. File level access will be cheaper, but query can give consumers a more targeted data set.