How to Evaluate Large Language Models

Navigating LLMs can be daunting. Here are some evaluation strategies designed specifically for tech practitioners and business leaders.

Published on Oct. 09, 2024
A bunch of ovals overlaying each other in the shape of a brain with an oval reading “LLM” in the center and circles with images representing images, videos, audio and language surrounding it.
Image: Shutterstock / Built In
Brand Studio Logo

Large language models are powerful tools capable of handling various language tasks. These models are integrated into multiple applications, and robust evaluation methodologies have become more critical than ever before.

This article explores the inherent challenges in using, and methods of evaluating, LLMs.

What Does It Mean to Evaluate an LLM?

Evaluating LLMs entails systematically assessing their performance and effectiveness in various tasks such as language comprehension, text generation and accuracy. This involves both quantitative metrics, like accuracy and Bilingual Evaluation Understudy scores, and qualitative evaluations of coherence and relevance. Some may also choose to consider broader ethical dimensions such as bias, fairness and alignment with human values, which, while important, do not currently have widely agreed upon standard metrics.

More on AIAI: An Overview

 

Challenges of LLM Evaluation

Evaluation is fundamental to any machine learning model development and deployment process. This necessity is amplified for LLMs because of their complex nature and wide-ranging applications.

During training and fine-tuning, ground truth (indisputable fact) examples exist, allowing the use of traditional evaluation methods not available in production. Reviewing these models in production presents unique challenges due to their scale and generality.

The more human-like the output, the more challenging the evaluation task becomes. Reasons for increased evaluation difficulty include the following.

Open-Ended

LLMs excel at tasks they were not explicitly trained for, a capability known as zero-shot, or few-shot, learning. This versatility, while powerful, makes evaluation complex, as the model’s potential outputs are vast and often unpredictable. Creating metrics for all conceivable tasks is impossible, so you must generalize the metrics to evaluate the output — not the resolution of the task.

Lack of Ground Truth

For generative tasks, there is no single correct answer to compare against. This absence of ground truth complicates traditional evaluation metrics.

For example, if a user asks a model to generate a song in the style of the user’s favorite artist, you can’t use metrics, like accuracy, to evaluate the model’s success. Without ground truth to compare against, using traditional methods for natural language processing is impossible, and evaluation of the task’s resolution is subjective.

Overconfidence and Misinformation

LLMs can generate incorrect information with high confidence, potentially spreading misinformation if not adequately managed. They may produce factually incorrect or nonsensical outputs, a phenomenon known as hallucination.

As people increasingly view models as experts in technical domains, this becomes a significant challenge to overcome. Users stop verifying output when the technical details are outside their domain of knowledge.

Training LLMs and production LLMs present different evaluation challenges and require distinct methodologies. We will discuss both cases, including how different implementations of interactions with LLMs, such as retrieval augmented generation, increase the evaluation options.

 

Evaluation During Training

The critical difference in evaluation during the training phase is the availability of a ground truth. This allows us to measure loss and perform validation tasks. You can employ several different methods.

  • Loss functions: People frequently use cross-entropy loss, but you can also combine tasks into a loss function. This approach was notably used in developing BERT, which employed a masked language model and next-sentence prediction task.
  • Perplexity: The level of uncertainty that the model generated the response.
  • Token-level accuracy: Percentage of correctly predicted tokens.
  • For models fine-tuned for specific tasks like machine translation, summarization, or question answering, choose metrics such as BLEU, Recall-Oriented Understudy for Gisting Evaluation, Mamba-based traversal of rationale or F1 scores.
  • Expert feedback: Human domain experts provide rate and review model outputs, helping refine the model. You can incorporate human evaluations via feedback loops to continually refine the model during training.

Developing and testing LLMs in controlled environments helps establish a baseline for performance, quality of output and a basic understanding of how the model will perform in similar conditions (and, with out-of-sample testing, how it will perform in novel situations). In production environments, however, the ground truth does not exist, so you need different evaluation methods.

More by These AuthorsWant Better AI Outputs? Use Prompt Tuning.

 

Evaluation in Production

Retrieval Augmented Generation

RAG has emerged as a popular technique for enhancing LLM performance and evaluation. RAG retrieves relevant information (chunks of documents) from a large corpus and then uses it as context for the LLM to respond to a user prompt.

Theoretically, this allows models to approximate the ground truth (the provided context) and limit hallucinations. Additionally, it enables the LLM to access up-to-date information without retraining the model. Because of the additional context, it’s easier to create an evaluation framework.

The first evaluation framework to understand is RAGAS, which offers a comprehensive approach to evaluating RAG systems. The critical components of RAGAS are as follows.

  • Accuracy/faithfulness: Did the LLM generate its answer based on the provided context?
  • Answer relevancy and context relevancy: How relevant is the response to the prompt, and how relevant is the returned context to the prompt?
  • Context precision: Are all the relevant items in the context chunks, and are they ranked higher than the irrelevant ones?
  • Context recall: How much does the retrieved context align with the generated answer?
  • Answer semantic similarity: How similar is the cosine between the response and the retrieved documents?
  • Answer correctness: What is the weighted average of semantic similarity and factual similarity of the ground truth and generated answer? You can determine the factual similarity by generating facts from the answer and seeing how many are in the ground truth.

Unlike traditional evaluation methods, the RAGAS framework uses LLMs to evaluate the RAG LLM; this methodology is demonstrated to have 80 percent agreement with human preferences on chat data.

It splits the context, generated answer and ground truth (if training) into different multi-hop questions. An example is having a model summarize the generated answer into bullet points and then asking the model if these can be derived from the context.

You can only use context precision, recall and answer correctness in training, while the other metrics are valid for production. Additionally, RAGAS includes a synthetic test generator based on evol-instruct for creating question-answer context pairs from a set of documents.

A survey of RAG evaluation frameworks and new framework retrieval, generation and additional requirements has been proposed here.

Evaluation Without External Context

Without the retrieved documents provided to the LLM in RAG, there is no additional context against which to evaluate the response, so we must look to alternative methods. We have grouped the available methods based on a combination of the information used to perform the evaluation and the evaluator.

LLM

These are methods where you use an independent LLM to judge the model’s output.

  • Answer relevancy checks: Using the RAGAS framework to determine the answer’s relevance to the prompt.
  • Consistency testing: Using an independent LLM to generate prompts from the original prompt, then generating responses to all prompts from the model and comparing the answers using either ROUGE or an LLM to assess model stability.
  • Contrastive evaluation: Assessing the model’s ability to distinguish between accurate and hallucinated information. For example, ask the model to generate hallucinated responses for a given task and then ask another LLM to determine which version is hallucinated and measure how often the original response is not chosen as the hallucinated response.
  • Semantic similarity analysis: Measuring the alignment between prompts and responses using techniques like BERT Score.

Training

These methods can use traditional training metrics or collect data to fine-tune the model.

  • Task-specific metrics: For tasks like translation or summarization, employing established metrics such as BLEU, ROUGE or METEOR.
  • User feedback integration: Incorporating direct user feedback on the accuracy of responses or suggested responses to continually refine and improve model performance.

User

These methods take indirect feedback and are perceived by users of the model.

  • User interaction analysis: Using follow-up questions and user behaviors to gauge the effectiveness of model responses. This could involve monitoring the number of secondary or follow-up questions to determine if the model is appropriately answering the prompts or using the edit distance between prompts to determine if users are rewriting prompts to get a helpful response.
  • Latency and throughput: How long does the model take to start responding and how many tokens are generated per second. A related set of metrics is the amount of resources consumed (memory, compute, electricity, etc.). These are all measures of the overall system performance.

 

Risk Assessment and Mitigation

A comprehensive risk assessment framework becomes crucial as LLMs are deployed in various applications. This involves identifying potential risks associated with model outputs, including misinformation, bias and inappropriate content generation.

Developing robust content filtering systems, determining the level of direct user interaction and implementing human in the loop processes for sensitive applications are essential to mitigating these risks. As the risk level increases, training evaluation metrics should get stricter, and you must collect inference metrics more frequently and closer to real-time.

Explore Job Matches.