Data scientists use Artificial Intelligence (AI) for...just about anything and everything. AI can run control systems to reduce building energy consumption, provide recommendations for clothes to buy or shows to watch, help improve farming practices to increase the amount of food we can grow sustainably, and someday an AI algorithm may even drive our cars.
Fortunately, getting started with AI isn’t all that challenging for those already experienced with Python and data analysis. You can leverage the powerful scikitlearn package to do most of the hard work for you.
Scikitlearn is a Python package designed to facilitate use of machine learning and AI algorithms. This package includes algorithms used for classification, regression and clustering such as random forests and gradient boosting. Scikitlearn was designed to easily interface with the common scientific packages NumPy and SciPy. Although scikitlearn wasn’t specifically designed to, it also interfaces excellently with Pandas.
What Is ScikitLearn?
Scikitlearn includes useful tools to facilitate use of machine learning algorithms. Developing machine learning pipelines that accurately predict the behavior of a system requires splitting the data into training and testing sets, scoring the algorithms to determine how well they function and ensuring models are neither overfit nor underfit.
How Do ScikitLearn Algorithms Work?
We can develop and test scikitlearn algorithms in three general steps.
3 Steps to Develop and Test ScikitLearn Algorithms
 Train the model using an existing data set describing the phenomena you need the model to predict.
 Test the model on another existing data set to ensure it performs well.
 Use the model to predict phenomena as needed for your project.
The scikitlearn application programming interface (API) provides commands to perform each of these steps with a single function call. All scikitlearn algorithms use the same function calls for this process, so if you learn it for one you learn it for all.
The function call to train a scikitlearn algorithm is .fit()
. To train each model you call the .fit
function, and pass it two components of the training data set. The two components are the x
data set, which provides the data describing the features of the data set, and the y
data, which provides the data describing the targets of the system.
Note: When we talk about features and targets, these are essentially machine learning terms for x
and y
data.
The algorithm then creates a mathematical model, as determined by the selected algorithm and the parameters of the model. The mathematical model matches the provided training data as well as possible. The algorithm then stores the parameters in the model, which allows you to call the fit version of the model as needed for your project.
The function to test the fit of the model is .score()
. To use this function you again call the function and pass an x
data set, which represents the features, and corresponding y
data set, which represents the targets.
It’s important the data set you use when you’re testing the data (your testing data set) is different from the data set you use to train the model. A model is quite likely to score very well when scored on the training data because you mathematically forced it to match that data set. The real test is how well the model performs on a different data set, which is the purpose of the testing data set. When calling the .score()
function, scikitlearn will return the r² value stating how well the model predicted the provided y
data set using the x
data set.
You can predict outputs of a system given the provided inputs using scikitlearn’s .predict()
function. It’s important you only do this after fitting the model. Fitting is how you adjust the model to match the data set, so if you don’t fit it first, then the model won’t provide a valuable prediction. Once you’ve fit the model, you can pass an x
data set to the .predict()
function and it will return a y
data set as predicted by the model. In this way, you can predict how a system will behave in the future.
These three functions form the core of the scikitlearn API, and go a long way to applying AI to your technical problems.
How Do I Create Training and Testing Data Sets?
Creating separate training and testing data sets is a critical component of training AI models. Without this step we can’t create a model that matches the system we’re trying to predict, nor can we verify the accuracy of our predictions. Fortunately, scikitlearn provides a useful tool to facilitate this process: train_test_split()
function.
Train_test_split()
does exactly what it sounds like it does. It splits a provided data set into training and testing data sets. You can use it to create the data sets that you need to ensure your model correctly predicts the system you’re studying. You provide a data set to train_test_split()
and it provides the training and testing data sets that you need.
There are a few things to be careful of when using train_test_split()
. First, train_test_split()
is random in nature, which means it won’t return the same training and testing data sets if you run with the same input data multiple times. This can be good if you want to test the variability of the model’s accuracy, but it can also be bad if you want to repeatedly use the same data set on the model. To ensure you get the same result every time you can use the random_state
parameter. The random state setting will force train_test_split()
to use the same randomization seed every time you run it and provide the same data set splits. When using random_state
it’s customary to set it to 42, probably as a humorous nod to The Hitchhiker’s Guide to the Galaxy more than for any technical reason.
How Does This All Work Together?
All combined, these tools create a streamlined interface to create and use scikitlearn tools. Let’s talk through it using the example of scikitlearn’s linear regression model.
To implement this process we must first import our tools: the scikitlearn model, the train_test_split()
function, and Pandas for the data analysis process. We import the functions as follows:
from scikitlearn.linear_model import LinearRegression
from scikitlearn.model_selection import train_test_split
import pandas as pd
We can then read in a data set so it’s available for training and testing the model. I’ve created a realistic data set demonstrating the performance of heat pump water heaters (HPWHs) as a function of their operating conditions specifically for helping people learn data science and engineering skills. Assuming you’ve downloaded that data set and saved it in the same folder as your script you can open it using the following line of code. If not, you can adjust these steps as needed to practice on any data set you like.
data = pd.read_csv('COP_HPWH_f_Tamb&Tavg.csv', index_col = 0)
The next step is to split the data set into the x
and y
data. To do this we create new data frames specifying the columns of the data set that represent the features and the targets. In the case of HPWHs, the features are tank temperature and ambient temperature while the target is electricity consumption.
The data set contains eight columns showing the water temperature at eight different depths in the water storage tank, each named Tx (deg F)
where x
is a number representing the location of the measurement.
The data set also contains a column showing the measured ambient temperature in the space surrounding the water heater, named T_Amb (deg F)
. Finally, the data set contains a column storing electricity consumption data called P_Elec (W)
. In this case, it’s also important to filter our data set such that we only use data when the system uses electricity. If we skip that step we’ll introduce nonlinearity into a linear model, which sets the model up to fail.
We can accomplish all those steps using the following code:
# Filter the data to only include points where power > 0
data = data[data['P_Elec (W)'] > 0]
# Identify X columns and create the X data set
x_columns = ['T_Amb (deg F)']
for i in range(1, 9):
x_columns.append('T{} (deg F)'.format(i))
x = data[x_columns]
# Create the y data set
y = data['P_Elec (W)']
Now that we have x
and y
data sets we can split them into training and testing data sets. We do this by calling scikitlearn’s train_test_split()
function as follows.
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 42)
Now that we have training and testing data sets ready to go, we can create and fit the linear regression model to the data set. First, we create an instance of the model then call the .fit()
function as follows.
model = LinearRegression()
model = model.fit(x_train, y_train)
Note this implementation used the default parameters of the linear regression model. This may or may not yield a good fit to the data, and we may need to change the parameters to get a good fit. For now, let’s use the default parameters to learn these concepts.
The next step is to score the model on the testing data set to ensure it fits the data set well. You can do this by calling .score()
and passing the testing data.
score = model.score(x_test, y_test)
If the model scores well on the testing data set then chances are you have a model that’s both well trained and appropriate for the data set. If the model doesn’t score well, then you need to consider gathering more data, adjusting the parameters of the model or using a different model entirely.
If the model performs well, then you can declare the model ready to use and start predicting the system’s behavior. Since we don’t have an additional data set to predict right now, we can simply predict the output on the testing data set. To do that you call the .predict()
function as follows.
predict = model.predict(x_test)
The predict variable will now hold the predicted output of the system when exposed to the inputs as defined by X_test
. You can then use these outputs to compare the values in y
test directly, which enables you to investigate the model fit and prediction accuracy more carefully.
How Well Does This Model Perform?
Since we calculated the score of the model and saved it to the variable score we can quickly see how well the model predicts the electricity consumption of the HPWH. In this case, the model’s score is 0.58.
R² is a metric that maxes out at one because one indicates the model perfectly explains the behavior of the system. The lower the value, the worse the fit (and, yes, it can be negative). An r² value of 0.58 indicates the model explains a bit of the observed behavior, but isn’t great.
As a reminder, to make improves we can:

Gather more data

Adjust the parameters of the model

Use a different model entirely
We certainly could gather more data or adjust the parameters of the linear regression model, but the core issue here is the relationship between heat pump power consumption and water temperature is likely nonlinear. It’s hard for a linear model to predict something that’s not linear!
We can try the same method using a model that’s designed for nonlinear systems and see if we get better results. One possible model is the random forest regressor. We can try it by adding the following code to the end of the script.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
score = model.score(x_test, y_test)
This method yields a very high score of 0.9999, which is suspicious in the other way. There’s a reasonable chance this model is overfit to the data set and won’t actually yield realistic predictions in the future. Unfortunately, that isn’t something we can truly determine given the available data set. If you use this model to start predicting the system you’ll need to monitor the model carefully to see how it performs as more data becomes available, and to keep training it. Plotting the predictions against the measured data would also provide insight into how the model behaves.
For this particular data set, I’ll say I trust this model because this data set doesn’t contain actual measured data; it’s an example data set I created by implementing a regression equation to show how a HPWH behaves in these conditions. This means the random forest regressor probably matches the data so well because it identified the equation I used to create the data set.
And with that, you should be in great shape to begin using scikitlearn to implement machine learning and AI! If you remember that all scikitlearn algorithms use fit()
, score()
and predict()
functions and you can create your data sets using train_test_split()
, then you’re well on your way to predicting system behavior.