What Is Standardization?
Standardization comes into the picture when features of the input data set have large differences between their ranges, or simply when they are measured in different units (e.g., pounds, meters, miles, etc.).
These differences in the ranges of initial features cause trouble for many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
To illustrate this with an example: Say we have a two-dimensional data set with two features, height in meters and weight in pounds, that range respectively from 1 to 2 meters and 10 to 200 pounds. No matter what distance-based model you perform on this data set, the weight feature will dominate over the height feature and will have more contribution to the distance computation, just because it has bigger values compared to the height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
How to Standardize Data?
Z-score is one of the most popular methods to standardize data, and can be done by subtracting the mean and dividing by the standard deviation for each value of each feature.
Once the standardization is done, all the features will have a mean of zero and a standard deviation of one, and thus, the same scale.
When to Standardize Data, and Why?
As seen above, for distance-based models, standardization is performed to prevent features with wider ranges from dominating the distance metric. But the reason we standardize data is not the same for all machine learning models, and differs from one model to another.
So, when should you standardize your data, and why?
1. Before Principal Component Analysis (PCA)
In principal component analysis, features with high variances or wide ranges get more weight than those with low variances, and consequently, they end up illegitimately dominating the first principal components (components with maximum variance). I used the word “illegitimately” here because the reason these features have high variances compared to the other ones is just because they were measured in different scales.
Standardization can prevent this, by giving the same weightage to all features.
2. Before Clustering
Clustering models are distance-based algorithms. In order to measure similarities between observations and form clusters they use a distance metric. So, features with high ranges will have a bigger influence on the clustering. Therefore, standardization is required before building a clustering model.
3. Before K-Nearest Neighbors (KNN)
K-nearest neighbors is a distance-based classifier that classifies new observations based on similar measures (e.g., distance metrics) with labeled observations of the training set. Standardization makes all variables contribute equally to the similarity measures.
4. Before Support Vector Machine (SVM)
Support vector machine tries to maximize the distance between the separating plane and the support vectors. If one feature has very large values, it will dominate over other features when calculating the distance. Standardization gives all features the same influence on the distance metric.
5. Before measuring variable importance in regression models
You can measure variable importance in regression analysis by fitting a regression model using the standardized independent variables and comparing the absolute value of their standardized coefficients. But, if the independent variables are not standardized, comparing their coefficients becomes meaningless.
6. Before Lasso and Ridge Regressions
Lasso and ridge regressions place a penalty on the magnitude of the coefficients associated with each variable, and the scale of variables will affect how much of a penalty will be applied on their coefficients. Coefficients of variables with a large variance are small and thus less penalized. Therefore, standardization is required before fitting both regressions.
Cases When Standardization Is Not Needed
Logistic Regressions and Tree-based models
Logistic regressions and tree-based algorithms such as decision trees, random forests and gradient boosting are not sensitive to the magnitude of variables. So standardization is not needed before fitting these kinds of models.
Wrapping Up Data Standardization
As we saw in this post, when to standardize and when not to depends on which model you want to use and what you want to do with it. Therefore, it’s very important for a ML developer to understand the internal functioning of machine learning algorithms, to be able to know when to standardize data and to build a successful machine learning model.
N.B: The list of models and methods for when standardization is required presented in this post is not exhaustive.
- 365DataScience.com: Explaining Standardization Step-By-Step
- Listendata.com: When and Why to Standardize a Variable