Every data set has issues, or points that don’t make sense. These points, referred to as outliers, can either show issues in the data collection process or real phenomena that are not representative of what typically happens. Here are a few examples of outliers that I’ve seen in real data sets:
- Sensor Error: In one project we were monitoring how people used hot water in their homes (with their knowing consent!). The data set included data showing a home using 400 gallons of water per minute, when the maximum output from the shower head was two gallons per minute. Clearly the sensor was reporting erroneously.
- Atypical Behavior: In that same project we recorded showers lasting over an hour. I’m not talking about one or two showers — it was a daily occurrence in one of the homes. Since it was so frequent, we were confident that it was real behavior and not a sensor issue. Since the typical shower is about 10 minutes long, we knew that this person took showers that were longer than everybody else.
Including outliers in your data analysis skews your data set and negatively impacts the results of your analysis. Therefore it’s important to make sure your data set excludes all outliers, and only uses the realistic data.
Let’s talk about how to do that using IQR (interquartile ranges).
How to Find Outliers With IQR
A Word of Caution
Before talking through the details of how to write Python code removing outliers, it’s important to mention that removing outliers is more of an art than a science. You need to carefully determine what is an outlier and what is not based on the context of your project. Here are a few examples using the outliers described above:
- Sensor Error: If I want to analyze the data to understand how people use hot water, then 400 gallons per minute is an outlier and should be discarded. Nobody actually used that much water, and including it would erroneously change the data set. On the other hand, if my study is about the accuracy and reliability of the sensor, then those data points accurately show that the sensor is occasionally very inaccurate. That’s important information!
- Atypical Behavior: If I want to understand how people commonly use hot water, then hour-long showers are an outlier and should be discarded. This is highly unusual behavior, and including it will cause my results to misalign with typical behavior. But if I want to know how much behavior between people varies then I need to include the atypical events because that’s the point of the entire study.
Identifying and Removing Outliers
With that word of caution in mind, one common way of identifying outliers is based on analyzing the statistical spread of the data set. In this method you identify the range of the data you want to use and exclude the rest. To do so you:
- Decide the range of data that you want to keep.
- Write the code to remove the data outside of that range.
Here’s a Python-based example using NumPy to exclude the highest and lowest five percent of all data points from a data set.
import numpy as np
np.random.seed(0)
random_data = np.random.rand(100, 1)
p95, p5 = np.percentile(random_data, [95, 5])
print(p95, p5)
print(len(random_data))
random_data = random_data[random_data < p95]
random_data = random_data[random_data > p5]
print(len(random_data))
That code yields the following outputs:
0.9456186092221561 0.05917358766052238
100
90
The first line in the above code imports the NumPy package for use in the analysis process. If you want to do data science with Python I recommend getting very, very familiar with NumPy.
The second line sets a seed for NumPy’s randomization code, telling it to return the same quasi-random numbers every time.
The third line creates a new variable, random_data
, which is an array of 100 random values between 0
and 1
.
The fourth line is where the magic starts. That line calls the NumPy percentile function to identify the value of the data at the ninety-fifth percentile and the fifth percentile, then stores those values in the p95
and p5
variables. These values set the bound which will later be used to limit the data set.
The next two lines are print statements showing what’s happening. The first prints p95
and p5
so that you can see the values at those percentiles. You can see they’re quite close to 95 percent and five percent of the upper range of the data set which, in a non-normal data set, is what we expect. The second prints the length of random_data
, showing that it still contains the 100 values that were originally entered.
The following two lines both reduce the data set based on the bounds specified above. The first line modifies random_data
to only include the values that are less than p95
, and the following line adjusts it again to include only values that are greater than p5
. In this way, the data set is reduced to include only values within the bounds set by the fifth and ninety fifth percentiles of the data set. This reduces the data set to 90 percent of the total values, and is equivalent to stating the largest and smallest five percent are all outliers.
The final line prints the length of random_data
after modification, and we can see that it’s now reduced to 90 data points as expected.
Using the IQR (Interquartile Range)
In order to limit the data set based on the percentiles you must first decide what range of the data set you want to keep. One way to examine the data is to limit it based on the IQR. The IQR is a statistical concept describing the spread of all data points within one quartile of the average, or the middle 50 percent range. The IQR is commonly used when people want to examine what the middle group of a population is doing. For instance, we often see IQR used to understand a school’s SAT or state standardized test scores.
When using the IQR to remove outliers you remove all points that lie outside the range defined by the quartiles +/- 1.5 * IQR
. For example, consider the following calculations.
quartile_1 = 0.45
quartile_3 = 0.55
IQR = 0.1
lower_bound = 0.45 - 1.5 * 0.1 = 0.3
upper_bound = 0.55 + 1.5 * 0.1 = 0.7
The following code shows an example of using IQR to identify and remove outliers.
import numpy as np
np.random.seed(0)
random_data = np.random.standard_normal(100000)
q3, q1 = np.percentile(random_data, [75, 25])
print(q3, q1)
print(len(random_data))
IQR = q3 - q1
print(IQR)
upper_bound = q3 + 1.5 * IQR
lower_bound = q1 - 1.5 * IQR
print(upper_bound)
print(lower_bound)
random_data = random_data[random_data < upper_bound]
random_data = random_data[random_data > lower_bound]
print(len(random_data))
Which returns the following outputs:
0.6734387006684548 -0.6686416961814367
100000
1.3420803968498913
2.686559295943292
-2.6817622914562738
99249
The code rejecting outliers using IQR has is different from the prior example code in the following ways:
- Creates an array of 100,000 values using a standard normal distribution. I made this change to ensure that the data set would include some outliers as defined by IQR.
- The percentiles have been changed from 95 and five to 75 and 25. This change calculates the values of the quartiles instead of the more extreme percentiles.
p95
andp5
have been renamed toq3
andq1
, indicating that they’re no longer tracking the ninety-fifth and fifth percentiles, but are instead tracking the first and third quartiles.- The IQR is calculated by subtracting
q1
fromq3
, and printed so you can see the calculated IQR. - The code calculates the upper and lower bounds as
1.5 * IQR
beyond the first and third quartiles, then prints those bounds. - The code then limits
random_data
to all points within the upper and lower bounds as defined using this method, instead of as defined using the ninety-fifth and fifth percentiles.
The outputs show that the code follows the same processes with the new requirements. It prints that the third quartile is at approximately0.68
, and the first quartile is at approximately -0.67
. Given that NumPy’s standard_normal
function uses a standard deviation of 1
, these numbers are almost exactly as expected. The code then prints that the total data set holds 100,000 points. The IQR is identified at 1.34
, which leads to upper and lower bounds of 2.69
and -2.68
. Filtering the code to only values within those two thresholds yields a data set of 99,249 points, indicating that 751 were outside of that range and removed.
And that’s how you do it! You can now think about how to identify outliers through both a practical and statistical approach, how to write generic code to remove outliers and how to evaluate your data set using the common interquartile range metric.