What Is Retrieval Augmented Generation (RAG)?

Use RAG to customize large language model outputs.

Written by James Evans
Published on May. 10, 2024
What Is Retrieval Augmented Generation (RAG)?
Image: Shutterstock / Built In
Brand Studio Logo

My company uses retrieval augmentation generation, or RAG, when building embedded user assistance agents — software that embeds into other customers’ software (web apps, websites, desktop apps, etc.) and helps their users learn and use the software. We use RAG to make the output of the large language model match the business problem.

What Is Retrieval Augmented Generation (RAG)?

RAG (retrieval augmented definition) is a technique used to customize the outputs of a LLM (large language model) for a specific domain without altering the underlying model itself.

We arrived at our RAG setup after trying many approaches; we have the benefit of scale because we serve more than 20 million end users, and partnership with them has helped us arrive at RAG as the right solution. 

More From James EvansHow to Get Into a Top Tech Accelerator in 2024

RAG: What Is Retrieval Augmented Generation?

Retrieval augmented generation, or RAG, is a way of making language models better at answering domain-specific queries by augmenting the user’s query with relevant information obtained from a corpus. It doesn’t make any attempt to change the model it’s using. It’s like hiring an adult human who is already trained and already has skills. All you need to do is hand a manual to a human and tell them to learn it. 

In concrete terms, this means that whenever someone wants to ask the LLM something, some other system must go find relevant information in the training material and feed it to the LLM. Usually this is performed by doing some type of semantic search to find documents, or subsections of documents,  that seem related to the query. 

A video tutorial explaining retrieval augmented generation. Credit: IBM Technology / YouTube

There are all kinds of clever search techniques to make this work better, but the general idea is “find the relevant information.” This is the retrieval step of RAG, and it happens before the LLM is invoked. 

Next, the LLM is asked a question that includes all the relevant information retrieved in the previous step. The prompt might read:

“The user asked: <insert user question here>.

Please answer the question using the information below:”

Usually, the goal of RAG is to restrict the LLM’s answer to the information contained in the relevant information, so the prompt might instead say: “Please answer the question using the information below and nothing else you know.”

When Should You Use RAG?

Here are some specific reasons when RAG is the best tool; it is usually all you need.

RAG Adapts to New Training on the Fly 

Just shove it into your prompt and go. This is important for us: Our customers’ product documentation is always changing and we wouldn’t have time to re-fine-tune a model every time it changes.

RAG Can Be Better at Restricting Answers to Specific Questions 

Even though a fine-tuned model can’t recall incorrect information, it can still make stuff up. It’s usually more effective to just tell a model: “don’t mention anything that isn’t in the information below.”

With RAG, You Can Get Started Immediately

Fine-tuning requires time. Training requires even more time. RAG requires zero training and zero ahead-of-time data collection.

RAG Allows You to Handle Data Silo Situations

Sometimes, data is siloed across an organization. For example, Department A might not be able to use data from Department B. RAG completely avoids this by not including any company-specific data in the model’s training set.

RAG Can Easily Adapt to Improvements in Foundational Models

Large players like OpenAI and Google spew out improvements in foundational models all the time. Imagine you fine-tune GPT3, and then GPT4 comes out. Now you have to fine-tune again. With RAG, you can just swap in the latest model and see how it performs relative to the last one. We do this all the time —  comparing how new models perform in our pipeline relative with what we currently have in production.

You can, of course, pair RAG with either foundation model training or fine-tuning. For example, we’re considering training our own model that is specifically designed to help users use software. We could then use that model instead of a generic LLM (like GPT4) in our RAG system.

I recommend only doing that if you don’t see performance with raw RAG that you want or if you have enough scale that improving performance by a few percentage points is worth the training cost.

Learn More About RAGHow to Build Your Own RAG System with Llamaindex and MongoDB RAG Time


When Should You Not Use Rag?

You saw the title of my article. You already know that I think RAG is usually the right answer. So let’s approach this question a bit differently and explore the situations in which you shouldn’t use RAG.

You Need Extreme Speed

RAG is typically slower than a raw or fine-tuned LLM because it involves a semantic retrieval step before the LLM call. This is usually fine for consumer applications, like ours at CommandBar, but there are tons of scenarios where mega speed is important. For example, let’s say you’re using an LLM to make a decision that blocks your API from returning. In that scenario, a 100ms retrieval step might not be acceptable.

Models are slow to train and fine-tune, but once trained/fine-tuned, they are fast to run. It goes without saying that big models are typically slower to run than small models.

You Need Domain-Specific Outputs

In scenarios where you need to generate extremely specific, high-dimensional outputs, it’s unlikely a general-purpose LLM will be good enough for you, although you should always try first before going down the path of training your own model).

For example, I’m pretty sure Suno, an incredibly cool tool for creating music from natural language prompts, has trained its own model or at least done a hefty amount of fine-tuning. 

Another example is Adept’s Fuyu model, which was built specifically for AI agents to navigate user interfaces. General-purpose large language models are particularly bad at this. We tried the basic version, just throwing a whole webpage DOM at GPT4 to see if it could use the user interface. It barely works.

Frequently Asked Questions

Use RAG when you need language output that is constrained to some area or knowledge base, when you want some degree of control over what the system outputs, when you don’t have time or resources to train or fine-tune a model, and when you want to take advantage of changes in foundation models like ChatGPT.

One is model training, which is creating your own model from scratch. Model training involves gathering a big training set and running it through an optimization procedure to get a set of weights that represent the model. 

Another is fine tuning, which is taking a model that has been trained and adapting it to your own use case by training it further.

Hiring Now
General Motors
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing