At the beginning of July, Elon Musk announced that Twitter will limit how many posts per day an account can view depending on whether or not it is verified. Unverified accounts could only view 1000 posts per day. He justified the decision as a measure against what he said was excessive data scraping. Though his assignation of fault was vague, Musk blamed artificial intelligence (AI) companies for the problems Twitter faced.
On the one hand, if data scraping activities really hinder Twitter users’ experience to the point that they can’t use the platform properly, then the company can and should decide what measures it wants to take to tackle the problem. Additionally, Musk is right to be concerned about data scraping on social media in general. These platforms contain a lot of personal information, which must be properly protected.
We should ask whether rate limiting really is the best solution against unethical scraping, however. As in everything, a balanced approach is best to avoid adverse effects. Although Twitter’s new restrictions might impede the data collection practices of profit-seeking businesses, they will likewise hamper organizations and individuals that pursue more noble goals.
Twitter’s Rate Limits and Data Scraping
Elon Musk recently announced that non-verified accounts would be limited to viewing 1,000 posts per day. The policy allegedly aims to combat excessive data scraping that is adversely affecting user experience on the platform. In practice, however, it is unlikely to deter large, for-profit entities while having far more adverse effects on ethical scrapers and individuals.
The Ethical Uses of Scraping
Social media platforms have been a vital source of data for legitimate and ethical web scraping use cases, such as investigative journalism and scientific research. Twitter data has been used for valuable academic studies, such as enriching medical data for approaches to cancer treatments, monitoring depression trends during the pandemic, and more generalized health research.
Investigative journalists gathered and analyzed social media data for socially important causes, from monitoring radical hate groups and illegal gun sales to tracking the activity of lobbyists. Reuters scraped thousands of posts from social media and online forums to uncover systemic child abuse. In such cases, data scraping served as a necessary tool for the functioning of democratic institutions and the rule of law.
Twitter's restriction for unverified accounts means that, in order to collect aggregated data from the platform at scale, one will have to meet verification requirements. This process entails revealing important personal data, such as name, phone number, and profile photo — a troubling measure for investigative journalists due to obvious safety and impartiality implications. At the same time, the decision to raise the price for accessing Twitter’s public API (which can now cost over $40,000 per month for an enterprise account) has already complicated a number of important scientific projects, forcing scientists and even disaster relief groups to pay for vital data.
Why Rate Limiting Isn’t the Answer
Limiting the scope of the content that a reader can access is just another technological measure employed to prevent bot activity, similar to IP blocks and CAPTCHAs. This decision doesn’t change much from a legal perspective, however. In fact, large, for-profit companies that want web-scraped data will probably easily circumvent the limit since they’re used bypassing to such obstacles.
We should note that ethical web scraping activities adhere to a different list of technical requirements and invest a reasonable level of effort to not hinder the performance of the target sites. They also avoid deliberately collecting personal data or behind logins. A guiding principle called the Ethical Web Data Collection Initiative, which unites major web scraping industry players, is actively working to promote ethical data collection guidelines, common standards, and online safety.
What Is the Future of Social Media Data Collection?
To sum everything up, these new Twitter restrictions might well become a stumbling block for the platform’s normal users and for legitimate web intelligence cases rather than a measure against bots. Nevertheless, I suspect that we are standing on the verge of seismic changes since the growth of the AI industry has fostered a number of high-level legal battles over data collection: Microsoft, Midjourney, OpenAI, and Google have been hit by lawsuits over scraping data for AI training needs. Previously, Meta sued Bright Data for scraping social media, which countersued Meta in return. It will be interesting to see what new rules, standards, and legal precedents will evolve from these lawsuits.