How to Fine-Tune LLMs

When pre-trained large language models don’t cut it for your business goals, fine-tuning may very well be the answer.

Published on Apr. 17, 2024
How to Fine-Tune LLMs
Image: Shutterstock / Built In
Brand Studio Logo

Large language models have revolutionized the field of artificial intelligence, enabling impressive natural language processing capabilities.

These models are trained on vast amounts of general data, which may not align with the specific requirements of a particular task or domain. Building foundation models isn’t required for most enterprises; you can directly use or jump-start multiple tasks with pre-trained LLMs.

Yet fine-tuning has emerged as an essential technique for adapting LLMs to specific tasks and improving performance.

What Is Fine-Tuning?

Fine-tuning is the process of updating a pre-trained LLM on a small data set to help it learn specific language patterns and vocabulary, reducing the amount of data required and computation time. Typically, you create a data set, freeze the base model and add new layers for the specific task.

New techniques, however, enable updating the full model without these constraints, allowing the model to learn multi-scale patterns rather than updating the understanding in the final embedding layers. Though many LLMs have yet to release their full training sets, they are known to have used web-scraped data and have a wide view.

Part 1 of This SeriesShould You Build or Buy AI?


Why Should You Fine-Tune LLMs?

While LLMs are incredibly versatile and capable of handling a wide range of natural language tasks, their capacity to deliver immediate, optimal results for specific use cases is limited. Well-known LLMs are trained to be good at multiple tasks, but unique tasks are often problematic. 

We recommend fine-tuning when the task is specific to the business and you plan to use it often. We also recommend fine-tuning if the business domain is not commonly found on the internet or the use case requires newer events.

Additional motivation for fine-tuning could be generating a distinctive style or length of response. Scientific, legal, medical and uncommon technological fields commonly benefit from fine-tuned LLMs.


How Does Fine-Tuning Work?

It’s important to remember that models do not know anything. They are a probabilistic representation of language. Fine-tuning weights the probabilities towards the relationships between words relevant to a specific domain (task). Assembling the data set is the first step — but also a challenge.

Gathering a high-quality, task-specific data set takes time and expertise. The data set must be representative of the target domain and contain examples that align with the desired task, with size depending on the complexity of the task and the available resources. It also needs to be well-structured and properly labeled to facilitate effective fine-tuning.

A potential format is instruction (the task), input (additional context) and output (desired response). With more data, fine-tuning will do better at re-weighting the model for the latest information. If the original model is not a base language model, such as an instruction model or a chatbot, knowing the format of the model’s training data is useful to match the format of the fine-tuning data. 


How to Create Training Data

One approach to generating training data is Human-in-the-Loop. Methods such as Evol or Self-Instruct expand training data quickly and rely on humans for quality control. 

The Hugging Face library abstracts most of the technical details of training, making it easier to swap models. After selecting a model, choose the appropriate class for the task, e.g., AutoModelForSequenceClassification, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM or AutoModelForCausalLM

Hugging Face takes the processed data set and format it for the chosen model using two classes: a tokenizer and a trainer object.


What Is PEFT?

By default, all model layers are trainable. For smaller models this is feasible, but for larger models with billions of parameters, this introduces compute and time constraints. To address this, layers are frozen by name or position, which reduces the requirements but prevents the model from focusing on the new task. Parameter-Efficient Fine Tuning makes it possible to update parameters on the full model. 


What Is LoRA?

Another popular method, Low-Rank Adaption, focuses on the attention weights of the model and uses matrix factorization to reduce the number of parameters while still being compatible with the original model size. This reduces the trainable parameter set to about 1 percent of the original size, decreasing both computation and data requirements.

Additionally, it’s possible to store only the deltas of the affected parameter weights, allowing for multiple sets of fine-tuned models. QLoRA is a method for implementing LoRA on a quantized model, which reduces memory usage and computing even further.


What Is IA3?

Infused Adapter by Inhibiting and Amplifying Inner Activations is another adapter method available in PEFT that uses re-scaled vectors instead of low-rank matrices and in some models reduces the number of parameters an additional order of magnitude from LoRA.

In addition to fine-tuning the model, some techniques focus on training a model to create soft prompts, or learnable embeddings optimized for a specific task. The model parameters are not updated. Only the representation of the instruction is updated. 

In contrast to normal or hard prompts, these embeddings are not human-readable. Methods such as prompt-tuning, prefix-tuning and p-tuning are available. 


How to Evaluate     

After fine-tuning an LLM, assessing performance on the target task is crucial. Evaluation metrics and techniques vary depending on the nature of the task. You should measure them on test data with the same format and distribution as the training set but without any examples used during training. 

You can quantify the performance of the fine-tuned LLM using metrics such as cross-entropy, precision, recall, F1-score, Recall-Oriented Understudy for Gisting Evaluation, Metric for Evaluation of Translation with Explicit Ordering or BiLingual Evaluation Understudy. Benchmarks such as GLUE or SuperGLUE help evaluate performance against multiple sets of tasks. 

You can also compare the performance to the pre-trained LLM or other baseline models on the same test set to determine how much improvement has been achieved by fine-tuning. This step is not straightforward and requires a critical and rigorous analysis and interpretation of the results.

More on Data ClassificationHere’s How to Take Control of Your Unstructured Data


Don’t Be Intimidated by Fine-Tuning     

Fine-tuning large language models is a powerful technique for adapting these versatile AI tools to specific tasks and domains. By using task-specific datasets and appropriate fine-tuning methods, you can tailor LLMs to deliver enhanced performance and more accurate results. 

Assembling high-quality data sets and evaluating the effectiveness of fine-tuning are critical steps in the process and present a high barrier to entry. Once the data set is prepared, however, fine-tuning is relatively straightforward and you can do it using commercial tools.

Hiring Now
Cloud • Healthtech • Social Impact • Software • Biotech