# How to Find Variance Using Python

If you’re wondering how to find the variance in your data set, look no further. Here’s how to calculate variance in a snap with Pandas. Eric Kleppen
Expert Columnist
June 2, 2022 Eric Kleppen
Expert Columnist
June 2, 2022

Variance is a powerful statistic used in data analysis and machine learning. It is one of the four main measures of variability along with range, interquartile range (IQR) and standard deviation. Understanding variance is important because it gives you insight into the spread of your data and can be used to compare differences in sample groups or identify important modeling features. Variance is also used in machine learning to understand changes in model performance due to using different samples of training data.

## What Is Variance?

Variance is a statistic that measures dispersion. Low variance indicates that values are generally similar and do not vary widely from the mean while high variance indicates values are more widely dispersed from the mean. You can use variance on either a sample set or the entire population as the calculation takes in all data points in the given set. Although the calculation differs slightly when you’re looking at a sample versus population, you can calculate the variance as the average of the squared differences from the mean.

Calculating variance is easy using Python. Before diving into the Python code, I’ll first explain what variance is and how you can calculate it. By the end of this tutorial you’ll have a better understanding of why variance is an important statistic, along with several methods for calculating it using Python.

More Data Science TutorialsHow to Find Outliers With IQR Using Python

## What Is Variance?

Variance is a statistic that measures dispersion. Low variance indicates that values are generally similar and do not vary widely from the mean while high variance indicates values are more widely dispersed from the mean. You can use variance on either a sample set or the entire population as the calculation takes in all data points in the given set. Although the calculation differs slightly when you’re looking at a sample versus population, you can calculate the variance as the average of the squared differences from the mean.

Since the variance is a squared value, it can be difficult to interpret compared to other measures of variability like standard deviation. Regardless, reviewing variance can be helpful; doing so can make it easier for you to decide which statistical tests to use with your data. Depending on the statistical tests, uneven variance between samples could skew or bias results.

## How to Find the Variance?

Calculating the variance for a data set can differ based on whether the set is the entire population or a sample of the population. The formula for calculating the variance of an entire population looks like this: σ² = ∑ (Xᵢ— μ)² / N

One of the popular statistical tests that applies variance is called the analysis of variance (ANOVA) test. An ANOVA test is used to gauge whether any of the group means are significantly different from one another when analyzing a categorical independent variable and a quantitative dependent variable. For example, say you want to analyze whether social media use impacts hours of sleep. You could break social media use into different categories like low use, medium use and high use, then run an ANOVA test to gauge whether there are statistical differences between the group means. The test can show whether results are explained by group differences or individual differences.

More From Built In Data ScientistsCalculating Quartiles: A Step-by-Step Explanation

## How Do You Find the Variance?

Calculating the variance for a data set can differ based on whether the set is the entire population or a sample of the population.

The formula for calculating the variance of an entire population looks like this:

`σ² = ∑ (Xᵢ— μ)² / N`

An explanation of the formula:

• `σ²` = population variance
• `Σ` = sum of…
• `Χᵢ` = each value
•  `μ` = population mean
• `Ν` = number of values in the population

Using an example range of numbers, let’s walk through the calculation step by step.

Example range of numbers: `8, 6, 12, 3, 13, 9`

Find the population mean (`μ`):

``````(8+ 6+12+ 3+ 13+ 9) / 6

= 51/6

= 8.5``````

Calculate deviations from the mean by subtracting the mean from each value.

``````8–8.5 = -0.5

6–8.5 = -2.5

12–8.5 = 3.5

3–8.5 = -5.5

13–8.5 = 4.5

9–8.5 = 0.5``````

Square each deviation to get a positive number.

``````-0.5² = 0.25

-2.5² = 6.25

3.5² = 12.25

-5.5² = 30.25

4.5² = 20.25

0.5² = 0.25``````

Sum the squared values.

``````0.25 + 6.25 + 12.25 + 30.25 + 20.25 + 0.25

= 69.5``````

Divide the sum of squares by `N` or `n-1`.

Since we’re working with the entire population, we’ll divide by `N`. If we were working with a sample of the population, we would divide by `n-1`

`69.5/6 = 11.583`

There we have it! The variance of our population is `11.583`.

### Why Use n-1 When Calculating the Sample Variance?

Applying `n-1` to the formula is called Bessel’s correction, named after Friedrich Bessel. When using samples, we need to calculate the estimated variance for the population. If we used `N` instead of `n-1` for the sample, the estimate would be biased, potentially underestimating the population variance. Using `n-1` will make the variance estimate larger, overestimating variability in samples, thus reducing biases.

Let’s recalculate the variance pretending the values are from a sample:

``````69.5 / (6–1)

= 69.5/ 5

= 13.9``````

As we can see, the variance is larger!

More From EricHow to Practice Word2Vec for NLP Using Python

## Calculating Variance Using Python

Now that we’ve done the calculation by hand, we can see that completing it for a large set of values would be very tedious. Luckily, Python can easily handle the calculation for very large data. We will explore two methods using Python:

• Write our own variance calculation function
• Use Pandas’ built-in function

### Writing a Variance Function

As we begin to write a function to calculation variance, think back to the steps we took when calculating by hand. We want the function to take in two parameters:

• `population`an array of numbers
• `is_sample`a Boolean to alter the calculation depending on whether we’re working with a sample or population

Start by defining the function that takes in the two parameters.

``def calculate_variance(population, is_sample = False):``

Next, add logic to calculate the population mean.

``````#calculate the mean
mean = (sum(population) / len(population))
``````

After calculating the mean, find the differences from the mean for each value. You can do this  in one line using a list comprehension.

``````#calculate differences
diff = [(v - mean) for v in population]
``````

Next, square the differences and sum them.

``````#Square differences and sum
sqr_diff = [d**2 for d in diff]
sum_sqr_diff = sum(sqr_diff)
``````

Lastly, calculate the variance. Using an `If/Else` statement, we can utilize the `is_sample`parameter. If `is_sample`is true, calculate variance using (`n-1`). If it is false (the default), use `N`:

``````#calculate variance
if is_sample == True:
variance = sum_sqr_diff/(len(population) - 1)
else:
variance = sum_sqr_diff/(len(population))

return variance
The complete calculate_variance function will look like this:
def calculate_variance(population, is_sample = False):
#calculate the mean
mean = (sum(population) / len(population))

#calculate differences
diff = [(v - mean) for v in population]

#Square differences and sum
sqr_diff = [d**2 for d in diff]

sum_sqr_diff = sum(sqr_diff)

#calculate variance
if is_sample == True:
variance = sum_sqr_diff/(len(population) - 1)
else:
variance = sum_sqr_diff/(len(population))

return variance``````

We can test the calculation using the range of numbers we crunched by hand:

## Finding Variance Using Pandas

Although we can write a function to calculate variance in less than 10 lines of code, there is an even easier way to find variance. You  can do it in one line of code using Pandas. Let’s load up some data and work through a real example of finding variance.

The Pandas example uses the BMW Price Challenge data set from Kaggle, which is free to download. Begin by importing the Pandas library, and then reading the CSV file into a Pandas data frame:

``````#import dependencies
import pandas as pd

``````

We can count the number of rows in the data set and display the first five rows to make sure everything loaded correctly:

``````#print row count
print(len(bmw_df))
#4843 rows

#verify the dataframe is as expected
``````

### Finding the Variance for the BMW Data

Since the BMW data set is 4843 rows, calculating that by hand would…not be fun. Instead we can simply plug in the column from the data frame into our `calculate_variance` function and return the variance. Let’s find the variance for the numeric columns `mileage`, `engine_power` and `price`.

### Using Pandas var() function

In case we forget the calculation for variance and cannot write our own function, Pandas has a built-in function to calculate variance named `var()`. By default, it assumes a sample population and uses n-1 in the calculation; however, you can adjust the calculation by passing in the `ddof=0` argument.

``bmw_df[['mileage', 'engine_power', 'price']].var()``

As we can see the `Var()` function matches the values produced by our `calculate_variance` function, and it’s only one line of code. Reviewing the results, we can see `mileage` has a high variance meaning the values tend to vary from the mean by a lot. That makes sense because many factors play into the distance a person needs to drive. By comparison, `engine_power` has a low variance which indicates the values don’t vary widely from the mean.

What Do You Want to Learn Next?How to Use Float in Python (With Sample Code!)

## The Takeaway

Understanding variance can be an important part of data analysis and machine learning because you  can use it to assess group differences. Variance also impacts which statistical tests can help us make data driven decisions. High variance means values are greatly dispersed from the mean, while low variance means numbers are not widely dispersed from the mean. If we have a small set of values, it’s possible to calculate the variance by hand in only five steps. For large data sets, we saw how simple it is to calculate variance using Python and Pandas. The `Var()` function in Pandas calculates the variance for the numerical columns in a data frame in only one line of code, which is pretty handy!

## Infinite Campus

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.