An LLM Is Like a Hungry Bear. Be Careful What You Feed It.

LLMs promise to democratize data science, but they also pose grave security risks if we’re not careful. Our expert offers tips for staying safe

Published on Aug. 07, 2024
A grizzly bear in a field
Image: Shutterstock / Built In

Imagine we’re hiking with our families here in Wisconsin and run into a black bear. I say, “Oh, I know that bear! He’s well trained. Here, let your kids feed him some nuts and berries.” 

Would you let your kids feed the bear? Probably not.

That, roughly, is the situation enterprises face when creating analytics with generative AI. No matter how you train an LLM, it is still a wild and occasionally dangerous animal. We know what LLMs do, but we don’t control what they generate or even know how and why they work. 

If you feed the LLM your data, that might come back to bite you (pun intended). I want to discuss the main risks — data security and hallucinations — and ways to overcome them. 

Safe Data Practices For LLM Use

For Coders

  • Send a schema of a data set to an LLM and ask how it would solve an analytical question about that set.
  • Pre-test an LLM’s reliability on question types by asking it something you already know the answer to and evaluate the results.

 

For Non-Coders

  • Use an analytics platform that sits between the user and LLM and shields enterprise data from the latter. 
  • Reserve generative AI analyses for data that is already public.

More Perspectives on AIEnterprises Should Learn From Academia’s Reproducibility Crisis

 

LLMs Cant Guarantee Data Security

For enterprises, the chief concern with LLMs is what they do with sensitive data. If one leaks identifiable customer information or proprietary data to malicious parties, lawsuits are sure to follow. Though LLM vendors try to mitigate this risk, they can’t (and don’t) guarantee that your enterprise data will remain private. Moreover, it’s impossible to evaluate their security claims. How would I know if a third party accessed my data unless they did something malicious with it? 

Even so, I bet employees all over the world are loading enterprise data into LLMs, unbeknownst to their IT departments. That happened with shared cloud storage soon after Dropbox and Box debuted, and it happened with so many other cloud apps that a buzzword for it emerged: rogue IT. Rather than fight an uphill battle against rogue IT, most companies have tried to mainstream and secure the tools their employees already use. 

That’s not as easy as with LLMs because, again, we don’t know how they work. Moreover, the terms of service are vague. Some platforms explicitly use your data for training; others restrict that use depending on how you share data (via a chat interface versus API, for example). Regardless, once the LLM has that data, it may reveal it to other users. In fact, researchers have manipulated LLMs into revealing their training data, which they found to contain personal information about real people.

No wonder some companies either forbid employees from using LLMs or forbid uploading data to them. No one wants to be the guinea pig in compliance experiments with a wild bear. 

More on Artificial IntelligenceExplore Built In’s AI Coverage

 

This Code Brought to You by Hallucination

Given the risks to data security, code-savvy data scientists and analysts have found workarounds to sharing data with LLMs. The most common is to use LLMs for code generation and then plug the code into a data workflow in, say, a Jupyter Notebook. Although that might be faster than writing code from scratch or copy-pasting it from sites like Stack Overflow, the risk of hallucinations remains. 

We need to be careful with the term hallucination, which normally refers to factually incorrect information from an LLM. Andrej Karpathy, an OpenAI co-founder, has argued that “Hallucination is all LLMs do. They are dream machines.” I don’t think he’s suggesting that LLMs are like bears on psilocybin. Rather, hallucinating is a feature, not a bug, of LLMs. 

The onus then is on users to inspect and validate the output code before incorporating it into a data workflow. Unfortunately, that kills a key value proposition for generative AI, which is how it enables people without any knowledge of Python, SQL, or data science to perform sophisticated analyses.

 

Safety in Bear Country

There are ways to use LLMs for analysis without handing over any data. There are also ways to improve the odds of getting factually correct hallucinations as opposed to the false ones. Some of these options are accessible to non-coders, and some aren’t. 

First, you can send a schema of a data set — the column names and their definitions in a particular business — to an LLM. Then, ask the LLM how it would solve an analytical question about that data set. You can ask the LLM how it came up with its answer before using it in a notebook. This will require some coding knowledge and experience with professional data science tools.

Second, you can pre-test an LLM’s reliability on question types by asking it something you already know the answer to. If I have a data set on customer churn, and I already know which variable best predicts churn, I can ask the LLM to produce an analysis using the schema method discussed above. If it reaches the same conclusion, I can be more confident when I request the code for a churn analysis on a new data set that uses the same schema. Granted, the code could change each time I prompt that analysis because LLMs are non-deterministic, meaning what you say in English and what they say in English does not correspond to exact code. 

For non-coders, there are at least two safe options. One is to use an analytics platform that sits between the user and LLM and shields enterprise data from the latter. This platform should automate the process of sending schema to the LLM, as described above, and include examples containing questions and solutions that are relevant to the user’s prompt. These examples reduce the odds of hallucination. 

The other safe option is to reserve generative AI analyses for data that is already public. The U.S. federal government, for example, publishes open data at data.gov, while platforms like Kaggle, Tableau, and data.world list public data sets that can inform business decisions. Google’s Data Set Search is probably the most powerful tool for finding open data sets. If a non-coder suspects a hallucination when working with these data sets, they can ask the LLM to explain in plain English how it reached the answer.

More in Generative AIWhat Role Should Generative AI Play in Coding in 2024?

 

Security Versus Convenience

Would it be easiest to just hand over data sets to LLMs and ask for an answer, security be damned? Yes, though token limits might prevent the LLM from analyzing a big data set anyway. Would building your own LLM or getting a private instance of an LLM address the security issues? Yes, but few companies have the know-how or budget to even entertain those options. 

It seems that LLMs have not “democratized data science” or anything close. They’ve enabled data scientists to accelerate workflows, and they’ve enabled software vendors to design analytics solutions that rely on LLMs while keeping them at a safe distance. 

So remember, no one controls LLM behavior. Be AI aware. Don’t feed one your data without taking precautions.

Explore Job Matches.