When dealing with statistics and machine learning problems, one of the most frequently encountered concepts is covariance. While most of us know that variance represents the variation of values in a single variable, we may not be sure what covariance means. However, understanding covariance can provide more information to help you solve multivariate problems. Most of the methods for preprocessing or predictive analysis also depend on the covariance, including, multivariate outlier detection, dimensionality reduction and regression.

What Is a Covariance Matrix?

Covariance matrices represent the covariance values of each pair of variables in multivariate data. These values show the distribution magnitude and direction of multivariate data in a multidimensional space and can allow you to gather information about how data spreads among two dimensions. 

I’m going to explain five things that you should know about covariance. Instead of just focusing on the definition, we will try to understand covariance from its formula. After reading this article, you will be able to answer the following questions.

  1. How is covariance calculated?
  2. What does covariance tell us?
  3. What is a strong covariance?
  4. What does the covariance matrix tell you?
  5. What do the eigenvectors and eigenvalues of the covariance matrix give us?

 

Variance and Covariance Formulas

In order to better understand covariance, we need to first go over variance. Variance explains how the values vary in a variable. It depends on how far the values are from each other. Take a look at Formula one to understand how variance is calculated.

variance formulas
Variance formulas according to the known and unknown population mean. | Image: Sergen Cansiz

In the formula, each value in the variable subtracts from the mean of that variable. After the differences are squared, it gets divided by the number of values (N) in that variable. What happens when the variance is low or high? As you can see in the image below, what happens when the variance value is low or high.

Difference between high and low variance graphs
Difference between high and low variance. | Image: Sergen Cansiz

Now, let’s look at the covariance formula. It’s as simple as the variance formula. Unlike variance, however, covariance is calculated between two different variables. Its purpose is to find the value that indicates how these two variables vary together. In the covariance formula, the values of both variables are multiplied by taking the difference from the mean. You can see this in the formula below.

Covariance formulas according to the known and unknown population mean
Covariance formulas according to the known and unknown population mean. | Image: Sergen Cansiz

The only difference between variance and covariance is that you’re using the values and means of two variables instead of one in covariance. Now, let’s take a look at the second thing that you should know.

There are two different formulas for when the population is known and unknown. When we work on sample data, we often don’t know the population mean. We only know the sample mean. That’s why we should use the formula with N-1. When we have the entire population of the subject, you can use N.

More on Data ScienceMahalanobis Distance and Multivariate Outlier Detection in R

 

Covariance Matrix Explained

The second thing that you should know about is the covariance matrix. Because covariance can only be calculated between two variables, covariance matrices stand for representing covariance values of each pair of variables in multivariate data. Also, the covariance between the same variables equals the variance. So, the diagonal shows the variance of each variable. Suppose there are two variables, x and y, in our data set. The covariance matrix should look like the formula below.

2 and 3-dimensional covariance matrices
Two- and three-dimensional covariance matrices. | Image: Sergen Cansiz

A symmetric matrix shows covariances of each pair of variables. These values in the covariance matrix show the distribution magnitude and direction of multivariate data in a multidimensional space. By controlling these values, we can gather information about how data spreads among two dimensions.

A tutorial explaining the basics of a covariance matrix. | Video: ritvikmath

 

Positive, Negative and Zero States of the Covariance

The third thing that you should know about covariance is its positive, negative and zero states. We can go over the formula to understand it. When Xi-Xmean and Yi-Ymean are both negative or positive at the same time, multiplication returns a positive value. If the sum of these values is positive, covariance returns as positive. This means variable X and variable Y variate in the same direction. In other words, if a value in variable X is higher, it is expected to be high in the corresponding value in variable Y, too. In short, there is a positive relationship between them. If there is a negative covariance, this is interpreted in the opposite direction. That is, there is a negative relationship between the two variables.

The covariance can only be zero when the sum of products of Xi-Xmean and Yi-Ymean is zero. However, the products of Xi-Xmean and Yi-Ymean can be near-zero when one or both are zero. In such a scenario, there aren’t any relations between variables. To understand it clearly, you can see the following image.

Positive, negative, and near-zero covariance
Positive, negative and near-zero covariance. | Image: Sergen Cansiz

In another possible scenario, we can have a distribution like in the image below. This happens when the covariance is near zero and the variance of variables are different.

Different variances and near-zero covariance
Different variances and near-zero covariance. | Image: Sergen Cansiz

 

Size of a Covariance Value

Unlike correlation, covariance values do not have a limit between -1 and 1. Therefore, it could be wrong to conclude that there might be a high relationship between variables when the covariance is high. The size of covariance values depends on the difference between values in variables. For instance, if the values are between 1,000 and 2,000 in the variable, it’s possible to have high covariance. However, if the values are between one and two in both variables, it’s possible to have a low covariance. Therefore, we can’t say the relationship in the first example is stronger than the second. The covariance stands only for the variation and relation direction between two variables. You can understand it from the image below.

High covariance values versus low covariance values graphs
High covariance values versus low covariance values. | Image: Sergen Cansiz

Although the covariance in the first figure is very large, the relationship can be higher or the same in the second figure. The values in the above image are given as examples; they aren’t from any data set and aren’t true values.

 

Eigenvalues and Eigenvectors of Covariance Matrix

What do eigenvalues and eigenvectors tell us? These are essential components of the covariance matrix. The methods that require a covariance matrix to find the magnitude and direction of the data points use eigenvalues and eigenvectors. For example, the eigenvalues represent the magnitude of the spread in the direction of the principal components in principal component analysis (PCA). 

In the image below, the first and second plots show the distribution of points when the covariance is near zero. When the covariance is zero, eigenvalues will be directly equal to the variance values. The third and fourth plots represent the distribution of points when the covariance is different from zero. Unlike the first two, eigenvalues and eigenvectors should be calculated for these two.

Eigenvalues and Eigenvectors of covariance and their effects on direction and magnitude
Eigenvalues and eigenvectors of covariance and their effects on direction and magnitude. | Image: Sergen Cansiz

As you can see in the above image, the eigenvalues represent the magnitude of the spread for both variables x and y. The eigenvectors show the direction. It’s possible to find the angle of propagation from the arccosine of the value v[0,0] when the covariance is positive. If the covariance is negative, the cosine of the value v[0,0]gives the spread direction.

How do you find eigenvalues and eigenvectors from the covariance matrix? You can find both eigenvectors and eigenvalues using NumPy in Python. First thing you should do is find the covariance matrix using the method numpy.cov(). After you’ve found the covariance matrix, you can use the method numpy.linalg.eig(M) to find eigenvectors and eigenvalues.

More on Data ScienceUsing T-SNE in Python to Visualize High-Dimensional Data Sets

 

Why Covariance Is Important

Covariance is one of the most used measurements in data science. Understanding covariance in detail provides more opportunities to understand multivariate data. 

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us