I’ve written previously about random forest regression, so now it’s time to dig deeper. Let’s jump into ensemble learning and how to implement it using Python. If you’d like to follow along with the tutorial, make sure to pull up the code.

What Is Random Forest Classifier?

Random forest classifier is an ensemble tree-based learning algorithm. The random forest classifier is a set of decision trees from a randomly selected subset of the training set. It aggregates the votes from different decision trees to decide the final class of the test object.


Ensemble Algorithm 

Ensemble algorithms are those which combine more than one algorithm of the same or different kind for classifying objects. For example, running a prediction over naive Bayes, SVM and decision tree and then taking a vote for final consideration of class for the test object.

Structure of random forest classification

More From Afroz ChakureWhat Is Decision Tree Classification?


Types of Random Forest Models

1. Random forest prediction for a classification problem:
f(x) = majority vote of all predicted classes over B trees

2. Random Forest Prediction for a regression problem:
f(x) = sum of all subtree predictions divided over B trees


An Example of Random Forest Classification 

Nine different decision tree classifiers
Aggregated result for the nine decision tree classifiers

We can aggregate the nine decision tree classifiers shown above into a random forest ensemble which combines their input (on the right). You can think of the horizontal and vertical axes of the above decision tree outputs as features x1 and x2. At certain values of each feature, the decision tree outputs a classification of blue, green, red, etc.

The above results are aggregated, through model votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.

Machine Learning Tutorial Python — Random Forest

Why Do We Use Random Forest Algorithm?

  • It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
  • It runs efficiently on large databases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

Disadvantages of Random Forest

  • Random forests have been observed to overfit for some data sets with noisy classification/regression tasks.
  • For data including categorical variables with different numbers of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.


Implementing Random Forest Classification on a Real-World Data Set

1. Importing Python Libraries and Loading our Data Set into a Data Frame



2. Splitting our Data Set Into Training Set and Test Set


More From Built In ExpertsHow to Get Started With Regression Trees


3. Creating a Random Forest Regression Model and Fitting it to the Training Data



4. Predicting the Test Set Results and Making the Confusion Matrix


There you have it! Now you know all about the random forest classifier and its implementation using Python. Now it’s time for you to try for yourself. Good luck!    

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us