A Beginner's Guide to Language Models

Summary: Language models create probability distributions over words or word sequences to predict the next word in a text, generate text, recognize handwriting and more. While artificial intelligence lacks understanding of human language, advances in language models inch closer to human-like intelligence.

Extracting information from textual data has changed dramatically over the past decade. As the term natural language processing has overtaken text mining as the name of the field, the methodology has changed tremendously, too. One of the main drivers of this change was the emergence of language models as a basis for many applications aiming to distill valuable insights from raw text.

Language Model Definition

A language model uses machine learning to conduct a probability distribution over words. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.

In learning about natural language processing, I’ve been fascinated by the evolution of language models over the past years. You may have heard about GPT-3 and the potential threats it poses, but how did we get this far? How can a machine produce an article that mimics a journalist?

A tutorial on the basics of language models. | Video: Victor Lavrenko

What Is a Language Model?

A language model is a probability distribution over words or word sequences. In practice, it gives the probability of a certain word sequence being “valid.” Validity in this context does not refer to grammatical validity. Instead, it means that it resembles how people write, which is what the language model learns. This is an important point. There’s no magic to a language model like other machine learning models, particularly deep neural networks — it’s just a tool to incorporate abundant information in a concise manner that’s reusable in an out-of-sample context.

More on Machine LearningThe Top 10 Machine Learning Algorithms to Know

What Can a Language Model Do?

The ability to estimate probabilities of words and word sequences occurring can be used for a number of tasks. Lemmatization or stemming aims to reduce a word to its most basic form, thereby dramatically decreasing the number of tokens. These algorithms work better if the part-of-speech role of the word is known. A verb’s postfixes can be different from a noun’s postfixes, hence the rationale for part-of-speech tagging (or POS-tagging), a common task for a language model.

With a good language model, we can perform extractive or abstractive summarization of texts. If we have models for different languages, a machine translation system can be built easily. Less straightforward use cases include answering questions (with or without context, see the example at the end of the article). Language models can also be used for speech recognition, OCR, handwriting recognition and more.

Types of Language Models

There are two types of language models:

Probabilistic methods.
Neural network-based modern language models

It’s important to note the difference between them.

Probabilistic Language Model

A simple probabilistic language model is constructed by calculating n-gram probabilities. An n-gram is an n word sequence, n being an integer greater than zero. An n-gram’s probability is the conditional probability that the n-gram’s last word follows a particular n-1 gram (leaving out the last word). It’s the proportion of occurrences of the last word following the n-1 gram leaving the last word out. This concept is a Markov assumption. Given the n-1 gram (the present), the n-gram probabilities (future) do not depend on the n-2, n-3, etc grams (past).

There are evident drawbacks to this approach. Most importantly, only the preceding n words affect the probability distribution of the next word. Complicated texts have deep context that may have decisive influence on the choice of the next word. Thus, what the next word is might not be evident from the previous n words, not even if n is 20 or 50. A term has influence on a previous word choice: the word United is much more probable if it is followed by States of America. Let’s call this the context problem.

On top of that, it’s evident that this approach scales poorly. As size increases (n), the number of possible permutations skyrockets, even though most of the permutations never occur in the text. And all the occuring probabilities (or all n-gram counts) have to be calculated and stored. In addition, non-occurring n-grams create a sparsity problem — the granularity of the probability distribution can be quite low. Word probabilities have few different values, so most of the words have the same probability.

Neural Network-Based Language Models

Neural network-based language models ease the sparsity problem by the way they encode inputs. Word embedding layers create an arbitrary-sized vector of each word that incorporates semantic relationships as well. These continuous vectors create the much-needed granularity in the probability distribution of the next word. Moreover, the language model is a function, as are all neural networks with lots of matrix computations, so it’s not necessary to store all n-gram counts to produce the probability distribution of the next word.

Evolution of Language Models

Even though neural networks solve the sparsity problem, the context problem remains. First, language models were developed to solve the context problem more efficiently — bringing more context words to influence the probability distribution. Secondly, the goal was to create an architecture that gives the model the ability to learn which context words are more important than others.

The first model, which I outlined previously, is a dense (or hidden) layer and an output layer stacked on top of a continuous bag-of-words (CBOW) Word2Vec model. A CBOW Word2Vec model is trained to guess the word from context. A Skip-Gram Word2Vec model does the opposite, guessing context from the word. In practice, a CBOW Word2Vec model requires a lot of examples of the following structure to train it: the inputs are n words before and/or after the word, which is the output. We can see that the context problem is still intact.

Recurrent Neural Networks (RNN)

Recurrent neural networks (RNNs) are an improvement regarding this matter. Since RNNs can be either a long short-term memory (LSTM) or a gated recurrent unit (GRU) cell-based network, they take all previous words into account when choosing the next word. AllenNLP’s ELMo takes this notion a step further, utilizing a bidirectional LSTM, which takes into account the context before and after the word counts.

Transformers

The main drawback of RNN-based architectures stems from their sequential nature. As a consequence, training times soar for long sequences because there is no possibility for parallelization. The solution for this problem is the transformer architecture.

The GPT models from OpenAI and Google’s BERT utilize the transformer architecture, as well. These models also employ a mechanism called “Attention,” which assigns a weight to each word in a sequence based on how relevant it is to a given context. Through this mechanism, the model can then learn which inputs deserve more attention than others depending on the situation.

In terms of model architecture, the main quantum leaps were firstly RNNs, specifically LSTM and GRU, solving the sparsity problem and reducing the disk space language models use, and subsequently, the transformer architecture, making parallelization possible and creating attention mechanisms. But architecture is not the only aspect a language model can excel in.

Compared to the GPT-1 architecture, GPT-3 has virtually nothing novel. But it’s huge. It has 175 billion parameters, and it was trained on the largest corpus a model has ever been trained on in common crawl. This is partly possible because of the semi-supervised training strategy of a language model. A text can be used as a training example with some words omitted. The incredible power of GPT-3 comes from the fact that it has been trained on massive volumes of text publicly available on the internet, and it can reflect most of the complexity natural language contains.

Trained for Multiple Purposes

Finally, I’d like to review the T5 model from Google. Previously, language models were used for standard NLP tasks, like part-of-speech (POS) tagging or machine translation with slight modifications. With a little retraining, BERT can be a POS-tagger because of its abstract ability to understand the underlying structure of natural language.

With T5, there is no need for any modifications for NLP tasks. If it gets a text with some <M> tokens in it, it knows that those tokens are gaps to fill with the appropriate words. It can also answer questions. If it receives some context after the questions, it searches the context for the answer. Otherwise, it answers from its own knowledge. Fun fact: It beat its own creators in a trivia quiz.

More on Language Models: NLP for Beginners: A Complete Guide

Future of Language Models

There’s a lot of buzz around AI, and many simple decision systems and almost any neural network are called AI, but this is mainly marketing. By definition, artificial intelligence involves human-like intelligence capabilities performed by a machine. Personally, I think the field of language models is the one closest to creating an AI. While transfer learning shines in the field of computer vision, and the notion of transfer learning is essential for an AI system, the very fact that the same model can complete a wide range of NLP tasks and infer what to do from the input is itself spectacular. It brings us one step closer to actually creating human-like intelligence systems.

Frequently Asked Questions

What is a language model?

A language model uses machine learning to assign probabilities to words, creating a probability distribution over words or word sequences. This allows language models to perform tasks like predicting the next word in a text.

What are the types of language models?

There are two main types of language models to know — probabilistic language models and neural network-based language models.

What are some real-world applications of language models?

While language models have many real-world applications, below are a few common use cases:

Speech recognition
Optical character recognition
Machine translation
Handwriting recognition
Text generation

A Beginner’s Guide to Language Models