Data scientists usually have to check if data is normally distributed. An example is the normality check on the residuals of linear regression in order to correctly use the F-test. One way to do that is through the Shapiro-Wilk test, which is a hypothesis test applied to a sample with a null hypothesis that the sample stems from a normal distribution.
Shapiro-Wilk Test Explained
Shapiro-Wilk test is a hypothesis test that evaluates whether a data set is normally distributed. It evaluates data from a sample with the null hypothesis that the data set is normally distributed. A large p-value indicates the data set is normally distributed, a low p-value indicates that it isn’t normally distributed.
Let’s see how we can check the normality of a data set.
What Is Normality?
Normality means that a particular sample has been generated from a Gaussian distribution. It doesn’t necessarily have to be a standardized normal distribution, with a zero mean and a variance equal to one.
There are several situations in which data scientists may need normally distributed data:
- To compare the residuals of linear regression in the training test with the residuals in the test set using an F-test.
- To compare the mean value of a variable across different groups using a one-way analysis of variance (ANOVA) test or a student’s t-test.
- To assess the linear correlation between two variables using a proper test on their Pearson’s correlation coefficient.
- To assess if the likelihood of a feature against a target in a Naive Bayes model allows us to use a Gaussian Naive Bayes classification model.
These are all different examples that may occur frequently in a data scientist’s everyday job.
Unfortunately, data is not always normally distributed. Although, we can apply some particular transformation to make a distribution more symmetrical, like in power transformation.
A good way to assess the normality of a data set would be to use a Q-Q plot, which gives us a graphical visualization of normality. But we often need a quantitative result to check and a chart couldn’t be enough. That’s why we can use a hypothesis test to assess the normality of a sample.
What Is the Shapiro-Wilk Test?
The Shapiro-Wilk test is a hypothesis test that is applied to a sample with a null hypothesis that the sample has been generated from a normal distribution. If the p-value is low, we can reject such a null hypothesis and say that the sample has not been generated from a normal distribution.
It’s an easy-to-use statistical tool that can help us find an answer to the normality check we need, but it has one flaw: It doesn’t work well with large data sets. The maximum allowed size for a data set depends on the implementation, but in Python, we see that a sample size larger than 5,000 will give us an approximate calculation for the p-value.
However, this test is still a very powerful tool we can use. Let’s see a practical example in Python.
Shapiro-Wilk Test Example in Python
First of all, let’s import NumPy and Matplotlib.
import numpy as np import
matplotlib.pyplot as plt
Now, we have to import the function that calculates the p-value of a Shapiro-Wilk test. It’s the “shapiro” function in scipy.stats
.
from scipy.stats import shapiro
Let’s now simulate two data sets, one generated from a normal distribution and the other generated from a uniform distribution.
x = np.random.normal(size=300)
y = np.random.uniform(size=300)
This is the histogram for “x”:
We can clearly see that the distribution is very similar to a normal distribution.
And this is the histogram for “y”:
As expected, the distribution is very far from a normal one.
So, we expect a Shapiro-Wilk test to give us a pretty large p-value for the “x” sample and a small p-value for the “y” sample, because it’s not normally distributed.
Let’s calculate such p-values:
shapiro(x)
# ShapiroResult(statistic=0.9944895505905151, pvalue=0.35326337814331055)
As we can see, the p-value for the “x” sample is not low enough for us to reject the null hypothesis.
If we calculate the p-value on “y,” we get a different result.
shapiro(y)
# ShapiroResult(statistic=0.9485685229301453, pvalue=9.571677672681744e-09)
The p-value is lower than 5 percent, so we can reject the null hypothesis of the normality of the data set.
If we try to calculate the p-value on a sample larger than 5,000 points, we get a warning:
shapiro(np.random.uniform(size=6000))
# /usr/local/lib/python3.7/dist-packages/scipy/stats/morestats.py:1760:
# UserWarning: p-value may not be accurate for N > 5000.
# warnings.warn("p-value may not be accurate for N > 5000.")
# ShapiroResult(statistic=0.9526152014732361, pvalue=2.6791145079733313e-40)
So, that’s how we can perform the Shapiro-Wilk test for normality in Python. Just make sure to use a properly shaped data set so that you don’t have to work with approximated p-values.
Advantages of the Shapiro-Wilk Test
The Shapiro-Wilk test for normality is a very simple-to-use tool of statistics to assess the normality of a data set. I typically apply it after creating data visualization set either via a histogram and/or a Q-Q plot. It’s a very useful tool to ensure that a normality requirement is satisfied every time we need it, and it must be present in every data scientist’s toolbox.