Meta is hungry for your data

A shift from web scraping to user datasets for generative AI

Jun 06, 2024

Last week, I and other European users of Meta's products received an email stating that they are "getting ready to expand our AI at Meta experiences to [our] region." Meaning, starting June 26, Meta will be training AI models on user data from Facebook and Instagram for users who don't opt out. This has driven a number of creators on Instagram to leave the platform in protest of the new policy.

In the fierce AI competition between Microsoft, Google, and Meta, there is a clear interest in using the vast amount of data on these platforms for AI training. Meta has started offering AI-generated ads in a push to generate revenue from their very expensive foray into generative AI. Meta's new policy is part of a shift away from web-scraped datasets towards privately owned, web-scale datasets. In this post, I'll cover the motivations for this shift, mostly from a privacy and copyright perspective. Big disclaimer upfront: I'm a technical AI researcher, not a lawyer, so be sure to season my opinions appropriately.

Generative AI is intensely data-hungry. I've covered training data for large language models previously, where enormous public datasets were gathered through web scraping to feed these massive models. Text-to-image models like Midjourney or DALL-E and are just as data-hungry as LLMs. Compared to the previous generation of image generation models, Generative Adversarial Networks (GANs), these new models, based on diffusion, can scale to much larger datasets. I could do a whole post on diffusion, which is also used for things like weather forecasting; let me know if that would be of interest in the comments.

Leave a comment

Like they did for LLMs, AI companies turned to web scraping to generate large datasets of text and image pairs. A prominent example is the LAION-5B dataset, used to train Stable Diffusion. LAION-5B has 5.85 billion image-text pairs sourced from Common Crawl, a non-profit organization that offers downloads of the public-facing web. However, scraping the whole public-facing web gives you a lot of things, including copyrighted images, private medical information, and depictions of child abuse. The uncurated nature of web-scraped datasets poses ethical and legal challenges, especially around data privacy and copyright.

The international standard for data privacy laws is the General Data Protection Regulation (GDPR) in Europe. Since its start in 2018, the GDPR has resulted in €4.5 billion in fines. One of the fundamental principles of the GDPR is user consent; users have to consent to how their personal data is being used. If your personal data was scraped from a site where you gave consent, and ended up on a site without your knowledge, you didn't consent to the new site's handling of your personal data. That's against the GDPR. LAION, which is based in Germany, has a takedown request form if you don't want your face or medical scans to be in their dataset. However, models trained on the dataset before the image was removed will still have the personal information encoded in the network's weights. My hope is that the European AI Act helps clarify that in the future, but there is a clear issue of privacy in training.

GDPR compliance is the first argument for companies to ask users for AI training consent, and it has the clearest legal precedent. Under the GDPR, new forms of manipulation of personal data require new consent, and users must always be able to opt out of the various ways their personal data is being manipulated. However, this only applies to personal data, such as names, dates of birth, and other identifying aspects. Some Instagram posts may constitute personal data, but that's not what makes artists worry about this new policy.

The second big issue is copyright. This could also be the subject of a complete post later, but briefly, the question of training AI on copyrighted data without the consent of the copyright holder is under fierce debate in court right now. There are over 20 ongoing lawsuits about training on copyrighted data, with plaintiffs including Getty Images, authors like George R. R. Martin, and the New York Times. Some of these cases have started litigation, but there is no clear verdict yet, and there will probably be conflicting verdicts in different cases. The protections from copyright law concerning AI training will take years to be fully worked out in court, but there is good reason to believe that generative AI training runs afoul of US copyright law.

AI generated image with Getty watermark — A totally normal, fair-use-respecting image from the stock photo website geetyinraəes

However, if users give their explicit consent to their data being used to train AI on platforms like Instagram, it becomes much harder to argue that their copyright or data privacy has been violated. This marks a shift compared to the main recent trend where newcomers like OpenAI and LAION created massive datasets through public web scraping, even scraping from other companies' platforms like YouTube. Instead, the new direction involves large companies, which already have vast user datasets, using them for AI training. And it isn't just Meta: Reddit has deals with Google and with OpenAI to allow training on user data, ex-Twitter's language model Grok is trained on the platform's tweets, and there's suspicion that Google's Bard and Gemini products use Gmail as training data.

Web scraping for AI training was always ethically unadvisable and often done with the excuse that it was only for research and not for commercial products. In 2024, that argument is clearly false, and the shift towards big tech policies on AI training isn't surprising. However, this new trend of blanket agreements that user data can all be used for AI training is also concerning to me.

Share Good Computer

First, this shift is happening with little true user consent. Users have reported that the opt-out form for Meta's new policy is incredibly frustrating to fill out or even find, and most weren't aware of the new policy until the notification to European users last week. Reddit's decision to sell its user data is retroactive, meaning that the many years of user data already stored on the site will be sold without the consent of those users.

Second, this shift will degrade data transparency. The one advantage of public databases like LAION is that they're public. Their contents can be studied, and the flaws revealed. A website that allows for searching the datasets, Have I Been Trained, has been fundamental in court cases to demonstrate privacy and copyright violations. With private datasets made of Instagram posts or Reddit comments (not all of which are public-facing), this transparency becomes much more difficult.

As this trend continues, expect to see most platforms include explicit rights to AI model training based on user data. Those who care about their privacy or the right to the content they put online are left with little recourse. I hope that Europeans will be mostly shielded by the GDPR and able to opt out based on privacy laws. Platforms like Cara are getting attention because they have a clear stance against AI training. Some artists are using tools like Glaze and Nightshade, which provide invisible, AI-proof watermarks. Opting out, changing platforms, and post-processing data to protect it are all actions that users have to take to protect their data, though. Without action, the default for the web, at least for the near future, seems to be a feeding ground for hungry generative AI.

Good Computer

Meta is hungry for your data

A shift from web scraping to user datasets for generative AI

Discussion about this post