Despite years as a crucial practice in many industries and its use in a number of publicly beneficial activities, to many, web scraping still evokes images of big corporations stealing data. The news of lawsuits aimed at curbing scraping perpetuates this perception, especially for AI training. Consumers who learn about web data collection only in these contexts might view it as an always harmful — and perhaps even illegal — activity.
It’s easy to miss that web scraping can also mean the collection of public data as a shared resource to fuel many legitimate use cases. Even attempts to restrict AI data scraping are really about controlling this resource and the associated monetary gains, as exemplified by Cloudflare first blocking AI data scrapers, then buying an AI data marketplace to sell AI companies the data they can no longer collect.
As this is just a symptom of a larger trend to restrict automated data access, we need to seriously consider what would happen if this trend runs its full course.
What Are the Benefits of Web Scraping?
Web scraping provides significant public benefits by enabling search engines to index the web, helping consumers find the best travel and retail deals and fueling financial models. Beyond business, it is a vital tool for investigative journalists, NGOs tracking disinformation, and law enforcement agencies combatting underground crimes like child exploitation.
Why Do People Hate Scraping?
Web scraping, like most ubiquitous technologies, has seen its fair share of abuse. There are, however, other factors in play that turn many people against the principles of open access even as they benefit from them.
Recent lawsuits involving AI are one such factor. The main legal concerns raised by the usage of large-scale web data collection technology can be divided into two main categories. On one side, lawsuits lean heavily on privacy policies and laws, and on the other, they cite data ownership issues and copyright law.
1. Data Privacy Concerns
Web scraping companies face legal action over alleged collection of personal data. As an example, in 2024, the European Data Protection Board published a report stating that OpenAI’s automated data collection from public online sources could not guarantee that personal data had not been used to train ChatGPT.
2. Intellectual Property and Data Ownership
In this instance, various creators, authors, publishers and other copyright owners file legal claims against the unauthorized use of copyrighted material for AI training and generative work, such as Getty’s claim over an AI image generator.
Despite OpenAI’s efforts to address privacy concerns and the fact that Getty Images largely lost its lawsuit, the negative sentiment generated by such cases is lasting.
This narrative is also fueled by corporate interests, which are not always aligned with consumers’ interests. For example, open access facilitates the implementation of AI agents to help users find the best deals when shopping online or booking trips. Platforms, however, profit from directing consumers in a certain way like, by promoting certain products or selling ad space. For them, the potential of data collection, portrayed as inherently bad, becomes the scapegoat to justify blocking unbiased third-party agents on their website.
This example shows how the fight against automated access to public data prevents consumers from using solutions that are largely already available. Now, imagine this tendency to restrict access continues to also remove the tools and services we’ve been using for years to search for information online, find best travel options or boost business decision-making with data.
A World Without Web Scrapers
The first thing you would notice if web scraping were universally banned is that Google and other search engines would also disappear. Google is made possible by web crawlers and fetchers collectively known as Googlebot. These crawlers index websites and enable the kind of search that brings relevant and up-to-date results to the top of your feed almost instantly.
Even if search engines were made an exception to stay active for convenience, industries that depend on them would still go under without automated web data access. Search engine optimization (SEO) depends on scraping data from search engine result pages. Similarly, its new evolution for the AI era, generative engine optimization (GEO), depends on scraping AI search outputs.
Additionally, aside from not having AI agents to book your trips, you’ll also have a tough time doing it yourself. Booking services that show you a long list of flight deals, which you can filter and arrange by price, time or other parameters, can’t operate without web scraping. Nor can similar hospitality platforms. Manually collecting and continuously updating data from multiple airlines and hospitality sites in real time would be unfeasible and unprofitable at scale.
Not only finding, but also offering better deals is hard without automated public data gathering. E-commerce platforms compete to collect pricing and product data from one another to be the one offering the best value to customers.
Thus, if you have invested in booking services or online retail, you might want to call your financial advisor once web scraping is gone. Only be cautious about their advice. Right now, they can base their guidance on models that use large volumes of data scraped from financial statements, social media sentiment and other sources. Without scraping, their advice might just be a glorified gut feeling.
A Blow to the Common Good
Of course, since it would hurt businesses, web scraping probably won’t be banned completely. Even increased restrictions on public data access are very damaging, however. And it’s not the business sector that hurts the most.
Organizations and agents combatting underground crime, such as the distribution of child pornography, also rely on web crawling and scraping. They use it to track content, filing the data into cases against perpetrators. Having to manually do this work would be extremely inefficient. The result is fewer cases made and more dangerous people still able to commit harm.
Similarly, investigative journalists use web scraping to uncover broader illegal activities. Last year, the Global Investigative Journalism Network cited web scraping as a vital method to help uncover gender-based violence happening in central Eswatini, for example.
Public web data collection allows journalists to improve on the government’s crime records, helping to generate a more accurate and independent record. That’s why the journalists spoke out in defense of data scraping when regulations began to curtail their work.
Tracking cyberbullying, online calls for violence and disinformation would also be much harder without automated web data collection. NGOs like the Center for Countering Digital Hate use this technology to expose such behavior on social media platforms. NGOs also use it as a cost-friendly way to optimize fundraising campaigns. It helps ensure that all resources accumulated from charitable people are used to do the most good.
Scraping Makes the Web Run
Scientists, journalists, law enforcement officers and other professional communities are well aware of the benefits of web intelligence. Thus, they are often the first to oppose restrictions on large-scale data collection.
Of course, this does not mean that all attempts to regulate web data collection are misguided. Productive regulation, however, is only possible when all stakeholders, including regulators and the general public, understand what open access and public data truly mean. Clearly communicating the benefits of web scraping to society should be an important part of the business agenda as we navigate decisive years for AI and big data.
