With more corporations incorporating large language models (LLMs) into their daily processes, the benefits of workflow efficiency come with a potential pitfall: could their intellectual property be incorporated into LLMs without their proprietary work being attributed back to them?
4 Disadvantages to LLM Scraping
- LLM crawlers may hit the site too hard and slow your site.
- AI sites may reproduce your company’s original content without attribution.
- The content is used to train an AI tool, whether you opt in or not.
- LLM crawlers don’t always follow robots.txt instructions or safe behavior on the web.
It’s a complex situation, one that did not exist when many businesses first launched their dot-com, uploaded product descriptions and published a company blog. Can ChatGPT, for example, now incorporate that text into its general understanding of everything, and never credit back to the original author? When should brands worry about this, when shouldn’t they, and is there anything that should be done in response?
How Search Engines Use AI Content
Google searches are trained to fetch keyword-dense results. Since a lot of these results are now generated by AI scripts effectively engineered for the purpose of ranking highly in its algorithm, Google is fetching less original content — penalizing creators of original content and leading to lower-quality search results.
The moral of this story is not “start using AI to trick Google into finding your website.” New core updates effect changes to Google’s search algorithm every few months. Long-term, the tech giant will try to refine search results so that original, high-quality content bubbles to the top — and right now, the best original content is still human authored.
Meanwhile, Google is also chasing OpenAI for LLM supremacy. Its Gemini AI tool is starting to provide its own LLM/AI-driven search responses, often at the top of a query. Microsoft/Bing is doing the same with Copilot, its AI tool.
To be included in any search result, Google must crawl your site. If Google is crawling your site, it can slow your site’s performance whenever it’s being indexed. However, Google does take your site’s performance and robots.txt file into account — more on that in a bit — allowing site owners to inform their crawler on appropriate ways to pull content from their website.
When and How to Safeguard Against LLM Scraping
There are several reasons you may not want your site to be crawled by an LLM. The crawlers may simply hit your site too hard, asking for a large amount of content from the site in a short amount of time and overwhelming your server — impacting delivery for your real audience. From an intellectual property standpoint, you may not want your original content to be analyzed and potentially plagiarized by an AI platform — especially one that won’t provide you with attribution. You probably didn’t write your content to provide value for somebody else’s business model, you might simply have an objection to your content being used to train an AI.
If you have public-facing intellectual property, and you always want it traced back to your site, consider safeguarding it from LLMs like ChatGPT, Gemini and Copilot, etc. This can be done one of two ways: putting the content behind a login page if you consider it sensitive, and/or via the instructions in a file titled robots.txt in your site’s back end. You can put technical safeguards in this file, such as instructing LLMs not to scrape your content.
Search engines provide attribution directly because they are built to link to your content, whereas LLMs can be asked for sources but don’t provide them by default, and sometimes, provide inaccurate or broken links even when asked. Some AI companies that are training LLMs have shown little regard for safe behaviors on the web, like obeying the robots.txt instructions — remember that it’s a convention, not a law. You can configure your web server to look for malicious behavior and either rate-limit (slow it down) or block it entirely.
Because LLMs also add traffic to your website, they carry the potential to hurt your site's performance. Depending on how often they visit and load each page, a webmaster might prefer to block an LLM. This might look like an unusual amount of traffic from a specific IP address or User Agent, something on the order of hundreds of requests per second is something a crawler can do that a normal human wouldn’t. In the early days of the web, it wasn’t uncommon for a server to get overwhelmed whenever Google would crawl their site; these days Google’s crawlers behave much better, but many LLM crawlers don’t yet show the same etiquette.
Other best practices for robots.txt files include:
- Make the instructions specific: “Attention ChatGPT bot: please don’t crawl this site, only crawl these parts of this site, only link to these pages with attribution, here’s how fast you can ask for things, etc.”
- When staging content for review, use technical safeguards to avoid drafted content becoming public. This could mean keeping the site behind your corporate firewall, or using robots.txt to tell crawlers to ignore the site.
- Review robots.txt regularly as part of your maintenance plan.
For crawlers that don’t respect the robots.txt standard – web application firewalls or content delivery networks like Cloudflare often have built-in rules that help identify malicious behavior and slow or prevent it entirely.
Other Strategies to Avoid LLM Scraping
Perhaps your company’s proprietary content is mostly video-based or audio-based (such as a podcast). Creating this content circumvents the thorny issues created by scraping written content, and a wide audience prefers video to writing already. LLMs are training on video and audio content using speech-to-text technology, but the engagement rates for video and audio content remain high for users that prefer that format.
Prefer writing? Write a book. The original content goes offline, it can help you crystallize your thoughts, and it can help inspire those three-to-five minute videos based on a chapter. Just remember that an ebook can end up as a web page — maybe hold off posting it publicly except as helpful excerpts.
If you’re a young brand creating awareness around a new problem, it might be beneficial to have everything on your site scraped by LLMs if you’re the first to point out the problem. Maybe you’re OK not getting credit for pointing out a problem exists if you’re the only company offering a solution to that problem. Merely creating awareness around an issue could be beneficial, even if it doesn’t tie directly back to your brand, when you’re the one creating the keywords – and therefore defining the market.
For some businesses, LLM scraping poses a real threat to protecting their intellectual capital, and the best solution will vary. To others, it’s no problem at all. In either case, stay vigilant.