Imagine for a moment that you live in a city with a problem. The gardeners in your city have recently taken to putting out garden gnomes and they’re mind-bogglingly ugly. The gardeners have even started competing to see who can place the biggest gnome in their garden. Obviously, this won't stand.
The mayor knows you’re an aspiring data scientist and approaches you for help. She wants somebody to track the location of all of the gnomes and develop a way to predict their locations. It’s a ridiculous task, but somebody has to do it.
As it turns out, the garden gnomes are all placed in a straight line through the city. This means you can create tools that tell you the location of a garden gnome if you know where it’s located either north-south or east-west. In other words: linear regression.
What Is Linear Regression?
In this case the relationship would be between the location of garden gnomes in the east-west dimension and the location of garden gnomes in the north-south dimension. The result is a single equation empowering you to calculate one if you know the other.
Note that linear regression is the simplest form of regression there is. There are two characteristics that make that the case. First, it’s only capable of capturing linear relationships. If there’s an exponential trend in your data, better not use it. Logarithmic? Nope, not that either. There are ways to massage the data to use linear regression on those data sets, but it can’t be done straight away.
Secondly, linear regression is only capable of handling relationships between two variables. If you have more than two variables in your data set, you need to start looking into multiple regression instead. That said, since linear regression is the simplest form of regression, it’s a good starting point.
Why Use Linear Regression?
At its core linear regression is a way to calculate the relationship between two variables. It assumes there’s a direct correlation between the two variables and that this relationship can be represented with a straight line.
Linear regression is the simplest form of regression there is.
These two variables are called the independent variable and the dependent variable, and they are given these names for fairly intuitive reasons. The independent variable is so named because the model assumes it can behave however it likes and doesn’t depend on the other variable for any reason. The dependent variable is the opposite; the model assumes it is a direct result of the independent variable and that its value is highly dependent on the independent variable.
Linear regression creates a linear mathematical relationship between two variables. It enables us to predict the dependent variable if we know the independent variable.
So let's return to the garden gnomes. We could create a regression with the east-west location of the garden gnome as the independent variable and the north-south location as the dependent variable. We could then calculate the north-south location of any gnome in the city so long as we know its east-west location.
Since it’s such a simple form of regression, the governing equation for linear regression is also quite simple:
y = B* x + A
Here y is the dependent variable, x is the independent variable, and A and B are coefficients determining the slope and intercept of the equation.
How to Calculate Coefficients
Essentially, we calculate the coefficients A and B to minimize the error between the model’s predictions and the actual data in the training set (if you aren’t familiar with training sets see How to Choose Between Multiple Models).
We calculate the error between the data and the predictions as:
Error = Actual — Prediction
Therefore, minimizing the error between the model’s predictions and the actual data means performing the following steps for each x value in your data set:
Use the linear regression equation, with values for A and B, to calculate predictions for each value of x.
Calculate the error for each value of x by subtracting the prediction for that x from the actual, known data.
Sum the error of all of the points to identify the total error from a linear regression equation using values for A and B.
Keep in mind some errors will be positive while others will be negative. Nevertheless, these errors will cancel each other out and bring the resulting error closer to 0, despite errors in both readings.
Take for instance two points, one with an error of five and the other with an error of -10. While we all know both points should be considered as causing 15 total points of error, the method described above treats them as negative five points of error. To overcome this problem, algorithms developing linear regression models use the squared error instead of simply the error. In other words, the formula for calculating error takes the form:
Error = (Actual — Prediction)²
Since negative values squared will always return positive values, this prevents the errors from canceling each other out and making bad models appear accurate.
Since the linear regression model minimizes the squared error, the solution is referred to as the least squares solution. This is the name for the combination of A and B that return the minimum squared error over the data set. Guessing and checking A and B would be extremely tedious. Using an optimization algorithm is another possibility, but would probably be time consuming.
Fortunately, mathematicians have found an algebraic solution to this problem. We can find the least squares solution using the following two equations:
B = correlation(x, y) * μ(y) / μ(x)
A = mean(y) — B * mean(x)
Here μ represents the standard deviations, mean represents the average or mean of y values in the data set, and correlation is a value representing the strength of correlation between the two.
Note: if you’re doing this work in the Python package pandas you’ll be able to use the DataFrame.mean() function to identify mean (y) and numpy has a function to find the correlation. For those who aren’t familiar with any of those terms, I recommend reading Python for Data Analysis as a way to get started.
The fact these two equations return the least squares solution isn’t incredibly intuitive, but we can make sense of it pretty quickly. Look at the equation for A. It essentially states we need a value which returns the average value of y (the dependent variable) when given the average value of x (the independent variable). It’s trying to create a line that runs through the center of the data set.
Now look at the equation for B. It states that the value of the dependent variable y should change by the standard deviation of y times the correlation between the two variables when the value of the independent variable changes by the standard deviation of x. In other words, the two values each change by one standard deviation multiplied by the correlation between the two.
Why Does This Work?
We typically use the least squares solution because of the maximum likelihood estimation (you can find a good explanation in Data Science from Scratch). We base the maximum likelihood estimation around identifying the value most likely to create a data set. Imagine a data set based around a parameter Z.
If you don’t actually know what Z is, then you could search for a value of Z that most likely yields the data set. This doesn’t mean you’ve found the right value for Z, but you have found the value of Z that makes the observed data set the most probable.
We can apply this sort of calculation to each data point in the data set, calculating the values of A and B that make the data set most probable. If you run through the match (which you can find in Data Science from Scratch) you discover that the least squares solution for A and B also maximizes the maximum likelihood for the data set.
Again, this doesn’t prove these are the values driving the data set but does say they’re the most likely values.
How Do I Know My Model Works?
As with all models, it’s imperative that you test it to ensure it’s performing well. This means comparing the model predictions to the actual data in the training, validation and testing data sets.
The preferred methodology varies depending on the type of model, but for linear regression we typically calculate the coefficient of determination, or r² value. The coefficient of determination captures how much of the trend in the data set can be correctly predicted by the linear regression model. It’s a value ranging from 0 to 1, with lower values indicating worse fit and higher values indicating better fit.
We calculate the coefficient of determination based on the sum of squared errors divided by the total squared variation of y values from their average value. That calculation yields the fraction of variation in the dependent variable not captured by the model.
Thus the coefficient of variation is one—that value. Or, in mathematical terms:
r² = 1 — (Sum of squared errors) / (Total sum of squares)
(Total sum of squares) = Sum(y_i — mean(y))²
(Sum of squared errors) = sum((Actual — Prediction)²)
What Are the Limits of Linear Regression?
Just like all algorithms, there are limits to the performance of linear regression.
As we’ve seen, the linear regression model is only capable of returning straight lines. This makes it wholly unsuited to match data sets with any sort of curve, such as exponential or logarithmic trends.
Linear regression only works when there’s a single dependent variable and a single independent variable. If you want to include multiple of either of those in your data set, you’ll need to use multiple regression.
Finally, don’t use a linear regression model to predict values outside of the range of your training data set. There’s no way to know that the same trends hold outside of the training data set and you may need a very different model to predict the behavior of the data set outside of those ranges. Because of this uncertainty, extrapolation can lead to inaccurate predictions–and then you'll never find those gnomes.