Linear regression, while a useful tool, has significant limits. As its name implies, it can’t easily match any data set that is non-linear. It can only be used to make predictions that fit within the range of the training data set. And, most importantly for our purposes, linear regression can only be fit to data sets with a single dependent variable and a single independent variable.

This is where multiple regression comes in. While multiple regression can’t overcome all of linear regression’s weaknesses, it’s specifically designed to create regressions on models with a single dependent variable and multiple independent variables.

## What Is Multiple Regression?

## Multiple Regression Equation

To start, let’s look at the general form of the equation for linear regression:

`y = B * x + A`

Here, `y`

is the dependent variable, `x`

is the independent variable, and `A`

and `B`

are coefficients dictating the equation. The difference between the equation for linear regression and the equation for multiple regression is that the equation for multiple regression must be able to handle several inputs, instead of only the single input of linear regression. To account for this change, the equation for multiple regression looks like this:

`y = B_1 * x_1 + B_2 * x_2 + … + B_n * x_n + A`

In this equation, the subscripts denote the different independent variables. For example, `x_1`

is the value of the first independent variable, `x_2`

is the value of the second independent variable, and so on. It keeps going as we add more independent variables until we finally add the last independent variable, `x_n`

, to the equation.

*Note: that this multiple regression model allows you to have any number, n, independent variables and more terms are added as needed. *

The `B`

coefficients employ the same subscripts, indicating they are the coefficients linked to each independent variable. `A`

, as before, is simply a constant stating the value of the dependent variable, `y`

, when all of the independent variables, the `xs`

, are zero.

Here’s a multiple regression example: Imagine that you’re a traffic planner in your city and need to estimate the average commute time of drivers going from the east side of the city to the west. You don’t know how long it takes on average, but you do know that it will depend on a number of factors like the distance driven, the number of stoplights on the route, and the number of other cars on the road. In that case you could create a linear multiple regression equation like the following:

`y = B_1 * Distance + B_2 * Stoplights + B_3 * Cars + A`

Here `y`

is the average commute time, `Distance`

is the distance between the starting and ending destinations, `Stoplights`

is the number of stoplights on the route, and `A`

is a constant representing other time consumers (e.g. putting on your seat belt, starting the car, maybe stopping at a coffee shop).

Now that you have your commute time prediction model, you need to fit your model to your training data set to minimize errors.

## Fitting a Multiple Regression Model

Similarly to how we minimize the sum of squared errors to find `B`

in linear regression, we minimize the sum of squared errors to find all of the `B`

terms in multiple regression. The difference here is that since there are multiple terms, and an unspecified number of terms until you create the model, there isn’t a simple algebraic solution to find `A`

and `B`

.

This means we need to use stochastic gradient descent. You can find a good description of stochastic gradient descent in *Data Science from Scratch* by Joel Gros or use tools in the Python Scikit-learn package. Fortunately, we can still present the equations needed to implement this solution before reading about the details.

The first step is summing the squared errors on each point. This takes the form:

`Error_Point = (Actual — Prediction)²`

In this instance, `Error`

is the error in the model when predicting a person’s commute time, `Actual`

is the actual value (or that person’s actual commute time), and `Prediction`

is the value predicted by the model (or that person’s commute time predicted by the model). `Actual - Prediction`

yields the error for a point, then squaring it yields the squared error for a point.

Remember that squaring the error is important because some errors will be positive while others will be negative and if not squared these errors will cancel each other out making the total error of the model look far smaller than it really is.

To find the error in the model, the error from each point must be summed across the entire data set. This essentially means that you use the model to predict the commute time for each data point that you have, subtract that value from the actual commute time in the data point to find the error, square that error, then sum all of the squared errors together. In other words, the error of the model is:

`Error_Model = sum(Actual_i — Prediction_i)²`

Here `i`

is an index iterating through all points in the data set.

Once the error function is determined, you need to put the model and error function through a stochastic gradient descent algorithm to minimize the error. The stochastic gradient descent algorithm will do this by minimizing the `B`

terms in the equation.

Once you’ve fit the model to your training data, the next step is to ensure that the model fits your full data set well.

## Is It a Good Fit?

To make sure your model fits the data use the same r² value that you use for linear regression. The `r²`

value (also called the coefficient of determination) states the portion of change in the data set predicted by the model. The value will range from `0`

to `1`

, with `0`

stating that the model has no ability to predict the result and 1 stating that the model predicts the result perfectly. You should expect the `r²`

value of any model you create to be between those two values. If it isn’t, retrace your steps because you’ve made a mistake somewhere.

You can calculate the coefficient of determination for a model using the following equations:

`r² = 1 — (Sum of squared errors) / (Total sum of squares)`

`(Total sum of squares) = Sum(y_i — mean(y))²`

`(Sum of squared errors) = sum((Actual_i — Prediction_i)²)`

Here’s where testing the fit of a multiple regression model gets complicated. Adding more terms to the multiple regression inherently improves the fit. Additional terms give the model more flexibility and new coefficients that can be tweaked to create a better fit. Additional terms will always yield a better fit to the training data whether the new term adds value to the model or not.

Adding new variables which don’t realistically have an impact on the dependent variable will yield a better fit to the training data, while creating an erroneous term in the model. For example, you can add a term describing the position of Saturn in the night sky to the driving time model. The regression equations will create a coefficient for that term, and it will cause the model to more closely fit the data set, but we all know that Saturn’s location doesn’t impact commute times. The Saturn location term will add noise to future predictions, leading to less accurate estimates of commute times even though it made the model more closely fit the training data set. This issue is referred to as “overfitting” the model.

Additional terms will always improve the model whether the new term adds significant value to the model or not.

This fact has important implications when developing multiple regression models. Yes, you could keep adding more terms to the equation until you either get a perfect match or run out variables to add. But then you’d end up with a very large, complex model that’s full of terms which aren’t actually relevant to the case you’re predicting.

## Which Parameters Are Most Important?

One way to determine which parameters are most important is to calculate the standard error of each coefficient. The standard error states how confident the model is about each coefficient, with larger values indicating that the model is less sure of that parameter. We can intuit this even without seeing the underlying equations. If the error associated with a term is typically high, that implies the term is not having a very strong impact on matching the model to the data set.

Calculating the standard error is an involved statistical process, and can’t be succinctly described in a short article. Fortunately there are Python packages available that you can use to do it for you. The question has been asked and answered on StackOverflow at least once. Those tools should get you started.

After calculating the standard error of each coefficient, you can use the results to identify which coefficients are highest and which are lowest. Since high values indicate that those terms add less predictive value to the model, you can know those terms are the least important to keep. At this point you can start choosing which terms in the model can be removed to reduce the number of terms in the equation without dramatically reducing the predictive power of the model.

Another method is to use a technique called regularization. Regularization works by adding a new term to the error calculation that is based on the number of terms in the multiple regression equation. More terms in the equation will inherently lead to a higher regularization error, while fewer terms inherently lead to a lower regularization error. Additionally, the penalty for adding terms in the regularization equation can be increased or decreased as desired. Increasing the penalty will also lead to a higher regularization error, while decreasing it will lead to a lower regularization error.

With a regularization term added to the error equation, minimizing the error means not just minimizing the error in the model but also minimizing the number of terms in the equation. This will inherently lead to a model with a worse fit to the training data, but will also inherently lead to a model with fewer terms in the equation. Higher penalty/term values in the regularization error create more pressure on the model to have fewer terms.

## How Can I Make Sense of My Model?

The model you’ve created is not just an equation with a bunch of numbers in it. Each one of the coefficients you derived states the impact an independent variable has on the dependent variable assuming all others are held equal. For instance, our commute time example says the average commute will take `B_2`

minutes longer for each stoplight in a person’s commute path. If the model development process returns 2.32 for B_2, that means each stoplight in a person’s path adds 2.32 minutes to the drive.

This is another reason it’s important to keep the number of terms in the equation low. As we add more terms it gets harder to keep track of the physical significance (and justify the presence) of each term. Anybody counting on the commute time predicting model would accept a term for commute distance but will be less understanding of a term for the location of Saturn in the night sky.

## Expanding the Multiple Regression Model

Note that this model doesn’t say anything about how parameters might affect each other. In looking at the equation, there’s no way that it could. The different coefficients are all connected to only a single physical parameter. If you believe two terms are related, you could create a new term based on the combination of those two. For instance, the number of stoplights on the commute could be a function of the distance of the commute. A potential equation for that could be:

`Stoplights = C_1 * Distance + D`

In this case, `C_1`

and D are regression coefficients similar to `B`

and `A`

in the commute distance regression equation. This term for stoplights could then be substituted into the commute distance regression equation, enabling the model to capture this relationship.

Another possible modification includes adding non-linear inputs. The multiple regression model itself is only capable of being linear, which is a limitation. You can however create non-linear terms in the model. For instance, say that one stoplight backing up can prevent traffic from passing through a prior stoplight. This could lead to an exponential impact from stoplights on the commute time. You could create a new term to capture this, and modify your commute distance algorithm accordingly. That would look something like:

`Stoplights_Squared = Stoplights²`

`y = B_1 * Distance + B_2 * Stoplights + B_3 * Cars + B_4 * Stoplights_Squared + C`

These two equations combine to create a linear regression term for your non linear `Stoplights_Squared`

input.

Multiple regression is an extension of linear regression models that allow predictions of systems with multiple independent variables. We do this by adding more terms to the linear regression equation, with each term representing the impact of a different physical parameter. When used with care, multiple regression models can simultaneously describe the physical principles acting on a data set and provide a powerful tool to predict the impacts of changes in the system described by the data.