What’s Wrong With Your Statistical Model? Skewed Data.

Skewed data can wreak havoc on your statistical model. Here’s what it is and how to handle it.

Written by Rajat Sharma
Image: Shutterstock / Built In
Image: Shutterstock / Built In
Brand Studio Logo
UPDATED BY
Brennan Whitfield | Nov 08, 2022

First things first: What is skewed data? We say data is skewed when the curve appears distorted to the left or right in a statistical distribution. In a normal distribution, the graph appears symmetrical, which means there are as many data values on the left side of the median as on the right side.

What Is Skewed Data?

Skewed data is data that creates an uneven curve distribution on a graph. We know data is skewed when the statistical distribution’s curve appears distorted to the left or right.

Let’s look at this height distribution graph as an example:

skewed data
In this graph, green indicates males and yellow indicates females.

Here, you can see the green graph (males) has symmetry at about 69, and the yellow graph (females) has symmetry at about 64. So, it means that most of the males in this data set have a height near 69, and most females have a height near 64. Then there are a few males who have height near 75 and 63 and females who have height near to 68 and 58.

skewed data
Image: theschoolrun.com

In the case of normal distribution, the mean, median and mode are close together. These three are all measures of the center of data. We can determine the skewness of the data by how these quantities relate to one another.

 

Right (or Positively) Skewed Data

A right-skewed distribution has a long tail that extends to the right or positive side of the x-axis, as you can see in the below plot.

skewed data

Here you can see the positions of all three data points on the plot. So, you see:

  • The mean is greater than the mode.
  • The median is greater than the mode.
  • The mean is greater than the median.

While the mean and the median will always be greater than the mode in a right-skewed distribution, the mean may not always be greater than the median.

Let’s look at some real-world examples.

skewed data

You can see this is right-skewed data with its tail in the positive side of the distribution. Here the distribution tells us that most people have incomes around $20,000 a year and the number of people with higher incomes exponentially decreases as we move to the right.

Now take a look at the following distribution from the 2002 General Social Survey. Respondents stated how many people older than 18 lived in their household.

skewed data

Here the distribution is skewed to the right. Although the mean is generally to the right of the median in a right-skewed distribution, that isn’t the case here.

 

Left (or Negatively) Skewed Data

A left-skewed distribution has a long tail that extends to the left (or negative) side of the x-axis, as you can see in the below plot.

skewed data

Here you can see the positions of all three data points on the plot. So, you will find:

  • The mean is greater than the mode.
  • The median is greater than the mode.
  • The mean is greater than the median.

While the mean and the median will always be greater than the mode in a right-skewed distribution, the mean may not always be greater than the median.

Let’s look at another real-world example.

skewed data

Here the distribution tells us most people die at an age of 90 (mode). Average life expectancy would be around 75 to 85 (mean). In the above distribution, you can see a small peak at the very beginning, which indicates there is a small percentage of the population who die during birth or in infancy. This population is acting as an outlier in our distribution.

 

What is Skewness? | Statistics | Don’t Memorise | Video: Don’t Memorise

 

My Data Is Skewed. So What?

Real-world distributions are usually skewed as we see in the above examples. But if there’s too much skewness in the data, then many statistical models don’t work effectively. Why is that?

In skewed data, the tail region may act as an outlier for the statistical model, and we know that outliers adversely affect a model’s performance, especially regression-based models. While there are statistical models that are robust enough to handle outliers like tree-based models, you’ll be limited in what other models you can try. So what do you do? You’ll need to transform the skewed data so that it becomes a Gaussian (or normal) distribution. Removing outliers and normalizing our data will allow us to experiment with more statistical models.

More From Our Data Science ExpertsWhat Is Multiple Regression?

 

Log Transformation for Skewed Data

Log Transformation is a data transformation method in which we apply logarithmic function to the data. It replaces each value x with log(x). A log transformation can help to fit a very skewed distribution into a Gaussian one. After log transformation, we can see patterns in our data much more easily. Here’s an example:

skewed data

In the above figure, you can clearly see the patterns after applying log transformation. Before that, we had too many outliers present, which will negatively affect our model’s performance.

If we have skewed data, then it may, well, skew our results. So, in order to use skewed data, we have to apply a log transformation over the whole set of values to discover patterns in the data and make it possible to draw insights from our statistical model.

Explore Job Matches.