When dealing with statistics and machine learning problems, one of the most frequently encountered concepts is covariance. While most of us know that variance represents the variation of values in a single variable, we may not be sure what covariance means. However, understanding covariance can provide more information to help you solve multivariate problems. Most of the methods for preprocessing or predictive analysis also depend on the covariance, including, multivariate outlier detection, dimensionality reduction and regression.
What Is a Covariance Matrix?
Covariance matrices represent the covariance values of each pair of variables in multivariate data. These values show the distribution magnitude and direction of multivariate data in a multidimensional space and can allow you to gather information about how data spreads among two dimensions.
I’m going to explain five things that you should know about covariance. Instead of just focusing on the definition, we will try to understand covariance from its formula. After reading this article, you will be able to answer the following questions.
- How is covariance calculated?
- What does covariance tell us?
- What is a strong covariance?
- What does the covariance matrix tell you?
- What do the eigenvectors and eigenvalues of the covariance matrix give us?
Variance and Covariance Formulas
In order to better understand covariance, we need to first go over variance. Variance explains how the values vary in a variable. It depends on how far the values are from each other. Take a look at Formula one to understand how variance is calculated.
In the formula, each value in the variable subtracts from the mean of that variable. After the differences are squared, it gets divided by the number of values (N) in that variable. What happens when the variance is low or high? As you can see in the image below, what happens when the variance value is low or high.
Now, let’s look at the covariance formula. It’s as simple as the variance formula. Unlike variance, however, covariance is calculated between two different variables. Its purpose is to find the value that indicates how these two variables vary together. In the covariance formula, the values of both variables are multiplied by taking the difference from the mean. You can see this in the formula below.
The only difference between variance and covariance is that you’re using the values and means of two variables instead of one in covariance. Now, let’s take a look at the second thing that you should know.
There are two different formulas for when the population is known and unknown. When we work on sample data, we often don’t know the population mean. We only know the sample mean. That’s why we should use the formula with N-1. When we have the entire population of the subject, you can use N.
Covariance Matrix Explained
The second thing that you should know about is the covariance matrix. Because covariance can only be calculated between two variables, covariance matrices stand for representing covariance values of each pair of variables in multivariate data. Also, the covariance between the same variables equals the variance. So, the diagonal shows the variance of each variable. Suppose there are two variables, x and y, in our data set. The covariance matrix should look like the formula below.
A symmetric matrix shows covariances of each pair of variables. These values in the covariance matrix show the distribution magnitude and direction of multivariate data in a multidimensional space. By controlling these values, we can gather information about how data spreads among two dimensions.
Positive, Negative and Zero States of the Covariance
The third thing that you should know about covariance is its positive, negative and zero states. We can go over the formula to understand it. When Xi-Xmean
and Yi-Ymean
are both negative or positive at the same time, multiplication returns a positive value. If the sum of these values is positive, covariance returns as positive. This means variable X and variable Y variate in the same direction. In other words, if a value in variable X is higher, it is expected to be high in the corresponding value in variable Y, too. In short, there is a positive relationship between them. If there is a negative covariance, this is interpreted in the opposite direction. That is, there is a negative relationship between the two variables.
The covariance can only be zero when the sum of products of Xi-Xmean
and Yi-Ymean
is zero. However, the products of Xi-Xmean
and Yi-Ymean
can be near-zero when one or both are zero. In such a scenario, there aren’t any relations between variables. To understand it clearly, you can see the following image.
In another possible scenario, we can have a distribution like in the image below. This happens when the covariance is near zero and the variance of variables are different.
Size of a Covariance Value
Unlike correlation, covariance values do not have a limit between -1 and 1. Therefore, it could be wrong to conclude that there might be a high relationship between variables when the covariance is high. The size of covariance values depends on the difference between values in variables. For instance, if the values are between 1,000 and 2,000 in the variable, it’s possible to have high covariance. However, if the values are between one and two in both variables, it’s possible to have a low covariance. Therefore, we can’t say the relationship in the first example is stronger than the second. The covariance stands only for the variation and relation direction between two variables. You can understand it from the image below.
Although the covariance in the first figure is very large, the relationship can be higher or the same in the second figure. The values in the above image are given as examples; they aren’t from any data set and aren’t true values.
Eigenvalues and Eigenvectors of Covariance Matrix
What do eigenvalues and eigenvectors tell us? These are essential components of the covariance matrix. The methods that require a covariance matrix to find the magnitude and direction of the data points use eigenvalues and eigenvectors. For example, the eigenvalues represent the magnitude of the spread in the direction of the principal components in principal component analysis (PCA).
In the image below, the first and second plots show the distribution of points when the covariance is near zero. When the covariance is zero, eigenvalues will be directly equal to the variance values. The third and fourth plots represent the distribution of points when the covariance is different from zero. Unlike the first two, eigenvalues and eigenvectors should be calculated for these two.
As you can see in the above image, the eigenvalues represent the magnitude of the spread for both variables x and y. The eigenvectors show the direction. It’s possible to find the angle of propagation from the arccosine of the value v[0,0]
when the covariance is positive. If the covariance is negative, the cosine of the value v[0,0]
gives the spread direction.
How do you find eigenvalues and eigenvectors from the covariance matrix? You can find both eigenvectors and eigenvalues using NumPy in Python. First thing you should do is find the covariance matrix using the method numpy.cov()
. After you’ve found the covariance matrix, you can use the method numpy.linalg.eig(M)
to find eigenvectors and eigenvalues.
Why Covariance Is Important
Covariance is one of the most used measurements in data science. Understanding covariance in detail provides more opportunities to understand multivariate data.