The hype over artificial intelligence is starting to fade. One of the sobering realizations was that training such large language models (LLM) bolsters personal data protection concerns. These concerns were already heightened by the increasing data usage in business, which had led to regulations such as the EU’s GDPR and California’s CCPA. 

Freely using public web data to create AI-based products and services has also raised pressing copyright and data ownership issues, resulting in a number of legal suits filed in 2023. These suits include a copyright infringement case against OpenAI for using 300,000 books to train its model, another lawsuit against Stability AI for using images and metadata owned by Getty, as well as a case against Github’s Copilot for republishing code, among others.

3 Legal Questions Facing AI in 2024

  1. Who owns the data that AI relies on for training?
  2. How should data creators be compensated for the use of their data?
  3. What impact will AI law in the EU and U.S. have on AI development?  

Some of the questions raised by those cases might be answered in courts as early as 2024. Lawsuits against AI firms and a web data collection provider should give us a clearer view of those areas.

The changes in regulation that such case law and public scrutiny will bring are bound to have an effect on how AI and automated data gathering develop in the future. Some trends and directions are already shaping up.


What Are the Main Legal and Ethical Questions Facing AI? 

In terms of web data collection, ongoing legal battles manifest all the main issues at hand. It’s a question of who owns the data. Social media sites will argue that as they allow posting users information, it’s in the interest of the users to have it protected from being collected and sold. Scraping services might answer that data made public is fair game for everyone, and it’s not up to social media giants to restrict access to it but rather for the users themselves to decide whether they allow their data to be gathered and for what purposes.

Importantly, the previous argument is limited to the data that is publicly available and accessible without logging into an account, which brings us to the second point. Creating accounts and logging in will mean agreeing to the site’s terms and conditions, which will usually ban bot activity and scraping. Thus, social media companies will argue breach of such accepted terms and conditions, which has been among the main arguments that kept scraping at bay.

Once again, there is a question of whether social media platforms are justified in limiting access to the data in this way since they don’t own the data created by users either. Here, user opinions might differ on whether social media sites are protecting their data or appropriating it by limiting public access to it.

Finally, there is a special concern regarding the data of minors. This is what a proposed class action lawsuit accuses a data provider in Israel, where selling data of minors is explicitly prohibited.

Since the training of LLMs and AI development are dependent on scraped data, AI faces the same legal issues. It’s all about data privacy, ownership and how, if at all, data creators should be compensated for their products being used to train machine learning algorithms that go on to produce something else.

More on AIData Privacy Laws Every Company Should Know


Will AI Development Halt in 2024?

Earlier this year, there was a call from 1,000 tech leaders to pause AI development for at least six months until we can better understand it. What motivated the open letter was the lack of regulation and multiple unknowns related to how AI works and develops. Now, it might seem like their plea is going to be answered to some extent due to the aforementioned legal and ethical concerns. However, it’s highly unlikely that regulation will seriously halt all AI development in the near future.

Although legal battles might temporarily delay the evolution of generative AI and tools based on machine learning, in the long run, clear regulation should give direction to the field. This can boost focused progress.

Additionally, there are other emerging techniques in the field that might offer much-needed technological breakthroughs. These include federated machine learning and causal AI, which might prove to be a step forward from only superficially intelligent generative systems. 

Federated learning is a framework that allows training machine learning algorithms without direct access to users’ personal data, solving the pressing issues of data privacy and isolated data islands. 

Causal AI, on the other hand, offers hope to solve the problem of predictive models hallucinating strange outcomes due to failing to grasp causal relations. Generative AI falls short of objectivity and accuracy because it equates correlation with causation. Causal AI functions more like the human mind, asking such questions as “what if” and examining the possible relationships between cause and effect. This could make AI more reliable. 

This allows the hope for successfully deploying such models in a broad range of use cases, for example, analyzing health data to assist in diagnosing and prevention. And it shows that AI still has a lot of room for development, even as we wait for broadening regulation.


Is General Web Scraping Regulation Coming?

When faced with the challenges of personal data protection in a highly digitalized society, the EU introduced GDPR as a general framework. After working on it since 2021 and being blindsided by generative AI models in 2022, the EU seems to have agreed on the foundational rules for the first comprehensive AI law as 2023 closes. 

The law will distinguish between different risk levels of AI models, banning those that are considered unacceptable and requiring transparency from tools like ChatGPT. In the U.S., President Joe Biden released an executive order outlining new standards for AI safety and innovation that could set the policy direction for federal agencies on AI.

Does this pave the way for a general web scraping regulation as well? It’s unlikely that we will see it any time soon, much less in 2024. Web scraping has many crucial applications, ranging from making Google and other search engines possible to driving investigative journalism and research across various fields, including AI.

It’s more likely that we will see more laws targeting specific illegitimate applications of web scraping infrastructure. For example, there are calls for expanding the BOTS Act that bans using automated solutions for ticketing.

Additionally, we might see broader adoption of tools and regulations like the Delete Act of California, going into power with the start of the new year. The act mandates regulators to create a tool allowing consumers to issue a single request to stop all data brokers registered in the state from storing and collecting their data.

More on AIHow Web Scraping Is Shaping the Future of Machine Learning


How Will These Lawsuits Impact AI in 2024?

Some mentioned regulatory changes may have effects within months, while lawsuits might take years to complete and establish mandatory rules, which isn’t even a given. In most cases, we will have to wait and see what exact changes it will bring to the data gathering industry, AI that depends on it, and the users of these services.

What we can expect is that case law and clearly defined rules will weed out the abusers of web scraping or force them to change. While it might bring internal chaos to some service providers, in the long run, it will hopefully stabilize both AI and data collection industries.

Great Companies Need Great People. That's Where We Come In.

Recruit With Us