Anytime we’re trying to make an important decision, we try to collect as much information as possible and reach out to experts for advice. The more information we can gather, the more we (and those around us) trust the decision-making process.
Machine learning predictions follow a similar behavior. Models process given inputs and produce an outcome. The outcome is a prediction based on what pattern the models see during the training process.
One model is not enough in many cases, and this article sheds light on this point. When and why do we need multiple models? How do we train those models? What kind of diversity should those models provide? So, let’s jump right in.
And if you want to see an example of how to build an ensemble model, you can skip to the end!
What Are Ensemble Models?
The technical challenges of building a single estimator include:
- High variance: The model is very sensitive to the provided inputs for the learned features.
- Low accuracy: One model (or one algorithm) to fit the entire training data might not provide you with the nuance your project requires.
- Features noise and bias: The model relies heavily on too few features while making a prediction.
Ensemble Algorithm
A single algorithm may not make the perfect prediction for a given data set. Machine learning algorithms have their limitations and producing a model with high accuracy is challenging. If we build and combine multiple models, we have the chance to boost the overall accuracy. We then implement the combination of models by aggregating the output from each model with two objectives:
- Reducing the model error
- Maintaining the model’s generalization
You can implement such aggregation using different techniques, sometimes referred to as meta-algorithms.
Ensemble Learning
When we’re building ensemble models, we’re not only focusing on the algorithm’s variance. For instance, we could build multiple C45 models where each model is learning a specific pattern specialized in predicting any given thing. Models we can use to obtain a meta-model are called weak learners. In this ensemble learning architecture, the inputs are passed to each weak learner while also collecting their predictions. We can use the combined prediction to build a final ensemble model.
One important thing to mention is that weak learners can have different ways of mapping features with variant decision boundaries.
Types of Ensemble Modeling Techniques
- Bagging
- Boosting
- Stacking
- Blending
Ensemble Techniques
Bagging
The idea of bagging is based on making the training data available to an iterative learning process. Each model learns the error produced by the previous model using a slightly different subset of the training data set. Bagging reduces variance and minimizes overfitting. One example of such a technique is the random forest algorithm.
Bootstrap Aggregation (Bagging)
This technique is based on a bootstrapping sampling technique. Bootstrapping creates multiple sets of the original training data with replacement. Replacement enables the duplication of sample instances in a set. Each subset has the same equal size and can be used to train models in parallel.
Random Forest
This technique uses a subset of training samples as well as a subset of features to build multiple split trees. Multiple decision trees are built to fit each training set. The distribution of samples/features is typically implemented in a random mode.
Extra-Trees Ensemble
Here’s another ensemble technique where the predictions are combined from many decision trees. Similar to random forest, it combines a large number of decision trees. However, the extra trees use the whole sample while choosing the splits randomly.
2. Boosting
Adaptive Boosting (AdaBoost)
This is an ensemble of algorithms, where we build models on the top of several weak learners . As we mentioned earlier, those learners are called weak because they are typically simple with limited prediction capabilities. The adaptation capability of AdaBoost made this technique one of the earliest successful binary classifiers.
Sequential Decision Trees
These were the core of such adaptability where each tree adjusts its weights based on prior knowledge of accuracies. Hence, we perform the training in such a technique in a sequential (rather than parallel) process. In this technique, the process of training and measuring the error in estimates can be repeated for a given number of iterations or when the error rate is not changing significantly.
Gradient Boosting
Gradient boosting algorithms are great techniques that have high predictive performance. Xgboost, LightGBM and CatBoost are popular boosting algorithms you can use for regression and classification problems. Their popularity has significantly increased after their proven ability to win some Kaggle competitions.
Stacking
Stacking is similar to boosting models; they produce more robust predictors. Stacking is a process of learning how to create such a stronger model from all weak learners’ predictions.
Please note that the algorithm is learning the prediction from each model (as features).
4. Blending
Blending is similar to the stacking approach, except the final model is learning the validation and testing data set along with predictions. Hence, the features used are extended to include the validation set.
Classification Problems
Classification is simply a categorization process. If we have multiple labels, we need to decide: Shall we build a single multi-label classifier? Or shall we perhaps build multiple binary classifiers? If we decide to build a number of binary classifiers, we need to interpret each model prediction. For instance, if we want to recognize four objects, each model tells you if the input data is a member of that category. Hence, each model provides a probability of membership. Similarly, we can build a final ensemble model combining those classifiers.
Regression Problems
In the previous function, we determined the best fitting membership using the resulting probability. In regression problems, we are not dealing with yes or no questions. Instead, we need to find the best predicted numerical values and then we can average the collected predictions.
Aggregating Predictions
When we ensemble multiple algorithms to adapt the prediction process to combine multiple models, we need an aggregating method. We can use three main techniques:
- Max Voting: The final prediction in this technique is made based on majority voting for classification problems.
- Averaging: This technique is typically used for regression problems where we average predictions. We can use the probability as well, for instance, by averaging the final classification.
- Weighted Average: Sometimes, we need to give weights to some models/algorithms when producing the final predictions.
How to Built an Ensemble Model: An Example
We’ll use the following example to illustrate how to build an ensemble model. We’ll use the Titanic data set and try to predict the survival of the Titanic using different techniques. A sample of the data set and the target column distribution to the passengers’ age are shown in figures six and figure seven. If you want to follow along you can find the sample code on my GitHub.
The Titanic data set is one of the classification problems that needs extensive feature engineering. Figure eight below shows the strong correlation between some features such as Parch
(parents and children) with the Family Size
. We will try to focus only on the model building and how the ensemble model can be applied to this use case.
We will use different algorithms and techniques; therefore, we will create a model object to increase code reusability.
# Model Class to be used for different ML algorithms
class ClassifierModel(object):
def __init__(self, clf, params=None):
self.clf = clf(**params)
def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
def fit(self,x,y):
return self.clf.fit(x,y)
def feature_importances(self,x,y):
return self.clf.fit(x,y).feature_importances_
def predict(self, x):
return self.clf.predict(x)
def trainModel(model, x_train, y_train, x_test, n_folds, seed):
cv = KFold(n_splits= n_folds, random_state=seed)
scores = cross_val_score(model.clf, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
Random Forest Classifier
# Random Forest parameters
rf_params = {
'n_estimators': 400,
'max_depth': 5,
'min_samples_leaf': 3,
'max_features' : 'sqrt',
}
rfc_model = ClassifierModel(clf=RandomForestClassifier, params=rf_params)
rfc_scores = trainModel(rfc_model,x_train, y_train, x_test, 5, 0)
rfc_scores
Extra Trees Classifier
# Extra Trees Parameters
et_params = {
'n_jobs': -1,
'n_estimators':400,
'max_depth': 5,
'min_samples_leaf': 2,
}
etc_model = ClassifierModel(clf=ExtraTreesClassifier, params=et_params)
etc_scores = trainModel(etc_model,x_train, y_train, x_test, 5, 0) # Random Forest
etc_scores
AdaBoost Classifier
# AdaBoost parameters
ada_params = {
'n_estimators': 400,
'learning_rate' : 0.65
}
ada_model = ClassifierModel(clf=AdaBoostClassifier, params=ada_params)
ada_scores = trainModel(ada_model,x_train, y_train, x_test, 5, 0) # Random Forest
ada_scores
XGBoost Classifier
# Gradient Boosting parameters
gb_params = {
'n_estimators': 400,
'max_depth': 6,
}
gbc_model = ClassifierModel(clf=GradientBoostingClassifier, params=gb_params)
gbc_scores = trainModel(gbc_model,x_train, y_train, x_test, 5, 0) # Random Forest
gbc_scores
Let’s combine all the model cross-validation accuracy on five folds.
Now let’s build a stacking model where a new stronger model learns the predictions from all these weak learners. Our label vector used to train the previous models will remain the same. The features are the predictions collected from each classifier.
x_train = np.column_stack(( etc_train_pred, rfc_train_pred, ada_train_pred, gbc_train_pred, svc_train_pred))
Now let’s see if building an XGBoost model learning only the resulting prediction will perform better. But first, we will take a quick peek at the correlations between the classifiers’ predictions.
We’ll now build a model to combine the predictions from multiple contributing classifiers.
def trainStackModel(x_train, y_train, x_test, n_folds, seed):
cv = KFold(n_splits= n_folds, random_state=seed)
gbm = xgb.XGBClassifier(
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1).fit(x_train, y_train)
scores = cross_val_score(gbm, x_train, y_train, scoring='accuracy', cv=cv)
return scores
The previously created base classifiers represent Level-0 models, and the new XGBoost model represents the Level-1 model. The combination illustrates a meta-model trained on the predictions of sample data. You can see a quick comparison between the accuracy of the new stacking model and the base classifiers below:
Careful Considerations
- Noise, Bias and Variance: The combination of decisions from multiple models can help improve the overall performance. Hence, one of the key reasons to use ensemble models is overcoming noise, bias and variance. If the ensemble model does not give the collective experience to improve upon the accuracy in such a situation, then it’s necessary to carefully rethink if such employment is necessary.
- Simplicity and Explainability: Machine learning models, especially those put into production environments, should be simple to explain. The chances you’ll be able to explain the final model decision is drastically reduced when you’re using an ensemble model.
- Generalizations: There are many claims that ensemble models have more ability to generalize, but other reported use cases have shown more generalization errors. Therefore, it’s likely that ensemble models with no careful training process can quickly produce high overfitting models.
- Inference Time: Although we might be resigned to accept a longer time for model training, inference time is still critical. When deploying ensemble models into production, the amount of time needed to pass multiple models increases and could slow down the prediction tasks’ throughput.
Ensemble models are an excellent method for machine learning because they offer a variety of techniques for classification and regression problems. Now you know the types of ensemble models, how we can build a simple ensemble model and how they boost the model accuracy.