In statistics, Q-Q (quantile-quantile) plots play a vital role in graphically analyzing and comparing two probability distributions by plotting their quantiles against each other. If the two distributions that we are comparing are exactly equal, then the points on the Q-Q plot will perfectly lie on a straight line y = x. Yes, it’s just that simple.
As a data scientist or a statistician in general, you need to know whether or not the distribution is normal so you can apply various statistical measures to the data and interpret it with much more human-understandable visualizations.
This is where Q-Q plots come into the picture. The most fundamental question answered by Q-Q plot is this: Is this curve normally distributed? For example, this curve is normally distributed. But why?
Q-Q plots are used to find the type of distribution for a random variable whether it be a Gaussian distribution, uniform distribution, exponential distribution or even a Pareto distribution. You can tell the type of distribution using the power of the Q-Q plot just by looking at it. In general, we are talking about normal distributions only because we have a beautiful concept called the 68–95–99.7 rule, which perfectly fits into the normal distribution. Thanks to this, we know how much of the data lies in the range of first, second, and third standard deviation from the mean. So, knowing whether a distribution is normal or not opens up new doors for us to easily experiment with the data. Further, normal distributions occur frequently in most natural events that have a vast scope.
What is a Q-Q Plot?
Q-Q (quantile-quantile) plots play a vital role in graphically analyzing and comparing two probability distributions by plotting their quantiles against each other. If the two distributions that we are comparing are exactly equal, then the points on the Q-Q plot will perfectly lie on a straight line y = x. A Q-Q plot tells us whether a data set is normally distributed.
How Does a Q-Q Plot Work?
We plot the theoretical quantiles, basically known as the standard normal variate (a normal distribution with mean of zero and a standard deviation of one) on the x-axis and the ordered values for the random variable, which we want to determine whether or not is a Gaussian distribution, on the y-axis. This gives a beautiful and smooth straight-line-like structure from each point plotted on the graph.
Now, we have to focus on the ends of the straight line. If the points at the ends of the curve formed from the points are not falling on a straight line but are scattered significantly from these positions, then we cannot conclude a relationship between the x- and y-axes. This result clearly signifies that the ordered values that we wanted to calculate are not normally distributed.
If all the points plotted on the graph perfectly lie on a straight line, then we can clearly say that this distribution is normal because it is evenly aligned with the standard normal variate, which is the simple concept of Q-Q plot.
What Are Skewed Q-Q Plots?
Q-Q plots are also used to find the skewness (a measure of asymmetry) of a distribution. When we plot theoretical quantiles on the x-axis and the sample quantiles whose distribution we want to know on the y-axis, then we see a very peculiar shape of a normally distributed Q-Q plot for skewness. If the bottom end of the Q-Q plot deviates from the straight line but the upper end does not, then we can clearly say that the distribution has a longer tail to its left. Put another way, it is left-skewed, also called negatively skewed. When we see the upper end of the Q-Q plot deviate from a straight line while the lower follows one, then the curve has a longer tail to its right and it is right-skewed, also called positively skewed.
What Are Tailed Q-Q Plots?
Similarly, we can talk about the kurtosis (a measure of “tailedness”) of the distribution by simply looking at its Q-Q plot. The distribution with a fat tail will have both ends of the Q-Q plot deviate from the straight line while its center follows a straight line. By contrast, a thin-tailed distribution will form a Q-Q plot with very little or negligible deviation at the ends, thus making it a perfect fit for the normal distribution.
How Much Data Do We Need for a Q-Q Plot?
Note that when the data points are few, the Q-Q plot does not perform very precisely, and it fails to give a conclusive answer. When we have an ample amount of data points and plot a Q-Q plot using a large data set, however, then it gives a result significant enough to draw clear conclusions about the type of distribution.
Q-Q Plots Implementation Examples in Python
Here is a simple implementation of plotting a Q-Q plot in Python.
import numpy as np
import statsmodels.api as sm
import pylab as py
# np.random generates different random numbers everytime the code is executed.
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='45')
Another Implementation of the Q-Q plot using the Scipy library.
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
n = 2000
observation = np.random.binomial(n, 0.53, size = 1000)/n
# standardize the observation
z = (observation-np.mean(observation))/np.std(observation)
stats.probplot(z, dist="norm", plot=plt)
plt.title("Normal Q-Q plot")