Automated Data Mining with Python Scripts

Working with large data sets can be a challenge. Learn how to automate your analysis using these simple Python scripts.

Written by Peter Grant
Automated Data Mining with Python Scripts
Image: Shutterstock / Built In
Brand Studio Logo
Jessica Powers | Dec 13, 2022

Scientists often find themselves with large data sets — sometimes in the form of gigabytes worth of data in a single file. Other times they’ll have hundreds of files, each containing a small amount of data. Either way, this much data is hard to manage, hard to make sense of, and even hard for your computer to process. You need a way to simplify the process, to make the data set more manageable and to help you keep track of everything.

That’s what this tutorial is all about. We’re in the process of writing Python scripts that will automatically analyze all your data for you and store it with meaningful, intuitive file names, all while using a real-world example. This way you know the skills you’re developing are practical and useful.

The introduction to the tutorial explained the concepts we’re using. If “heat pump water heater,” “coefficient of performance (COP),” and “performance map” don’t mean anything to you, check it out.

Part two introduced the companion data set, and split the data set into multiple files with user-friendly names.

(The companion data set is a valuable part of the tutorial process, as it allows you to follow along. You can write the exact same code that I’ll present, run the code, see the results and compare it to results I present.)

We now have three data files, each containing test results at a specified ambient temperature. In each of those files we see electricity consumption of the heat pump, the temperature of water in the storage tank, and the air temperature surrounding the water heater.

The next step is to process those results. We need to write code that automatically makes sense of the data, calculates the COP of the heat pump and plots the data so we can understand it visually.

So let’s get started. As with all Python programming, we need to first import packages.

Do You Have a Data Problem?How to Split Unmanageable Data Sets


Prework: What Packages Do I Import?

There are several packages that’ll be important for this step. They are:

Key Packages for Automating Data Analysis With Python

  • glob: glob is a list creation package. It reads the files stored in a folder and creates a list containing all of them. Since Python is great at iterating through lists, we can use glob to create a list of all of our data files and have Python analyze them one at a time.
  • Pandas: Need I state the importance of this package? Pandas is the gold standard of Python's data analysis tools.
  • os: os gives us access to operating system commands. Which sounds intimidating because you could potentially screw up your computer using it. Don’t worry; we’ll only be using it to check for and add folders.
  • NumPy: NumPy is a great numerical package that enables powerful data analysis tools. In this specific case we’ll be using NumPy to create regressions of our data.
  • Bokeh: Bokeh is an interactive plotting tool in Python. It enables you to write code that automatically generates plots while analyzing your data, and gives options for the user to interact with them. 

We won’t use Bokeh in the script for this portion of the tutorial; however, this script will be the basis for future parts of the tutorial. It’s worth importing Bokeh now so you don’t have to worry about it later.

For this project we’ll need to import the entirety of glob, Pandas, os and NumPy while only importing certain functions from Bokeh. These can all be called in using the following code:i

mport glob
import pandas as pd
import os
import numpy as np
from bokeh.plotting import figure, save, gridplot, output_file

Note we imported and assigned Pandas to pd while we assigned NumPy to np. This means we can now reference these packages by writing “pd” and “np” instead of Pandas and NumPy, respectively.

Now that we’ve imported all of our packages, the next step is to create our glob list and for loop to sequentially iterate through all of your files. When that’s prepared, you can write code analyzing each of our data files.

Want More on For Loops?How to Improve Your Control Flow Coding in Python


How Do I Iterate Through My Files?

The first step is to configure glob. To do this, you need to provide glob with a path and a filetype (The filetype is optional, but all of our data is in .csv format so we want to specify that to avoid including extraneous files). Then when you call glob, it will create a list of all files of that type in the specified folder. You can do this using the following lines of code:

Path = r'C:\Users\YourName\Documents\AutomatingDataAnalysis\Files_IndividualTests'

Filenames = glob.glob(Path + '/*.csv')

The Path variable tells glob to look in the specified folder for relevant files. Note the folder specified above is the same folder you saved files to in part two of this tutorial.

The second line calls the glob function from the glob package to create a list of all the specified files. Notice that the code references the Path variable you specified to state what folder glob should search for files, followed by code stating glob should include all .csv files and nothing else.

After this code executes, glob will create a list of all .csv files in the folder. It will have the form [1.csv, 2. csv, 3.csv, …, n.csv]. For those who’ve downloaded the companion data set and are following the tutorial, it will be the full path for PerformanceMap_HPWH_55.csv, PerformanceMap_HPWH_70.csv, and PerformanceMap_HPWH_95.csv.

Now that you have a list of files, the next step is to create a for loop that iterates through each of these files. You then need to open the files so you can perform data analysis on each of them. To do so, you need the following code:

for Filename in Filenames:

    Data = pd.read_csv(Filename)

This code automatically iterates through every entry in the file names list. Note: the way we’ve written this leads to file name holding the actual file name of each entry in the list. For instance, on the first time through the for loop file name will contain the full path for: PerformanceMap_HPWH_55.csv.

The second line uses Pandas to read the file into memory, and saves it to Data for later analysis.

Now that the code has located the files and is opening them sequentially, the next step is to add code to analyze each file. 


How Do I Automatically Analyze Each File?

This requires a fair amount of knowledge about the equipment we’re talking about on top of knowing Python so bear with me.

Filtering the Data to Only Contain the Important Portion

For this process, we only care about data when the heat pump in the HPWH is active. The heat pumps in these devices typically draw 400-600 watts (W) depending on the ambient and water temperatures. Meanwhile, HPWHs have on-board electronics that consume some electricity. To filter the data to the segment we care about we need to remove all data with electricity consumption less than 300 W, which is significantly higher than the power draw of the on-board electronics but under the minimum draw of the heat pump. We can do this with the following line:

Data = Data[Data['P_Elec (W)'] > 300]

That line resets our data frame to have only the data where the device is consuming over 300 W. However, that did impact the index of the data frame, so we need to reset that to keep our data frame clean. We can use the following code:

Data = Data.reset_index()
del Data['index']

Identifying the Change in Time Between Measurements

The timestamp data in this data set is not easy to work with. Fortunately, we know from collaborating with our lab testing partner that measurements were taken every 10 seconds. We can create a new column in our data frame that states how long the test has been active using the following code:

Data[‘Time Since Test Start (min)’] = Data.index * 10./60.

Calculating the Change in Energy Stored in the Water

One of the key parameters impacting the COP of HPWHs is the temperature of water in the storage tank. The water won’t be mixed well enough to hold a single, constant temperature. Typically there will be cold water on the bottom of the tank, and hot water at the top. For the sake of this exercise, it’s good enough to calculate the average temperature of the tank. Since our lab tester was kind enough to inform us they used eight temperature measurements evenly spaced throughout the tank, we can calculate the average water temperature using:

Data['Average Tank Temperature (deg F)'] = (1./8.) * (Data['T1 (deg F)'] + Data['T2 (deg F)'] + Data['T3 (deg F)'] + Data['T4 (deg F)'] + Data['T5 (deg F)'] + Data['T6 (deg F)'] + Data['T7 (deg F)'] + Data['T8 (deg F)'])

Now, what we really care about here is how much the average temperature of the tank changes from one measurement time to another. This way we can identify the change in energy stored in the tank, and thus the energy added to the water by the heat pump. We can do this using the following two lines of code:

Data['Previous Average Tank Temperature (deg F)'] = Data['Average Tank Temperature (deg F)'].shift(periods = 1)

Data.loc[0, 'Previous Average Tank Temperature (deg F)'] = 72.0

The first line uses the .shift command of a pandas data frame to create a new column in the data frame containing the ‘Average Tank Temperature (deg F)’ data, but shifted down one row in the data frame. This creates an empty cell in the first row (Index 0) which causes errors when performing calculations. 

The second line of code rectifies this by using .loc to fill this cell with 72.0. We can do this because our friendly lab tester told us that the tests started precisely at 72.0 deg F every single time.

Now we can calculate the change in energy stored in the water between every two timestamps. To do this, we need to know a few constants and equations:

  • First, the equation to identify the change in energy in water in Energy = Mass * Specific_Heat * (Temperature_Final — Temperature_Initial).

  • Second, we know that the storage tank of the HPWH holds 80 gallons (again, thanks to the communication of our friendly lab tester).

  • Third, the density of water is 8.3176 lb/gal.

  • Fourth, the specific heat of water is 0.998 Btu/lb-F.

We can put it all together and calculate the change in stored energy using the following line:

Data['Change in Stored Energy (Btu)'] =  (80 * 8.3176) * (0.998) * (Data['Average Tank Temperature (deg F)'] - Data['Previous Average Tank Temperature (deg F)'])


Calculating the COP

The next step in analyzing each data set is calculating the COP as a function of the water temperature in the tank. The goal of this tutorial is to identify the COP as a function of both water temperature and ambient temperature, and this will provide an understanding of the COP as a function of water temperature at each specified ambient temperature.

To calculate the COP of the heat pump we need to perform some unit conversions. The electricity consumption is currently expressed in W while the energy added to the water is currently expressed in Btu/timestep. To make this unit conversion we use the ratio 1 W = 3.412142 Btu/hr, then convert Btu/hr to Btu/s and multiply by the 10 seconds per timestamp. This gives the code:

Data['P_Elec (Btu/10s)'] = Data['P_Elec (W)'] * (3.412142/60/60) * 10

The COP is by definition the amount of heat added to the water divided by the amount of electricity consumed. Thus, we can calculate it with:

Data['COP (-)'] = Data['Change in Stored Energy (Btu)'] / Data['P_Elec (Btu/10s)']


Generating Regressions

Now we have a table showing the COP as a function of the water temperature at each of the three specified COPs. But we can do better than that. Wouldn’t it be nice to have a function we can use to calculate the COP so we can just enter the water temperature and identify the COP accordingly?

NumPy provides the tools to make this easy. We can use the NumPy function “polyfit” to identify the coefficients of a regression describing the COP as a function of the water temperature. It’s a flexible function, allowing you to control the shape of the curve by specifying the order of the function at the end. Since the COP of a heat pump as a function of temperature is a parabolic curve, we need a second order regression for this example. Thus, the following line identifies the coefficients: 

Coefficients = np.polyfit(Data[‘Average Tank Temperature (deg F)’], Data[‘COP (-)’], 2)

We can use the NumPy “poly1d” function to create a regression using those coefficients with:

Regression = np.poly1d(Coefficients)

Now you can identify the COP of the heat pump at a specified water temperature using this regression. Remember the regression is only generated for a specific air temperature, so only estimate the COP using the regression for the correct air temperature. 

(Creating a 2-d performance map is the ultimate goal of this tutorial, but we aren’t there yet.)

We can identify the COP at a specified water temperature by calling the function and using the water temperature as an input. For instance, if you want to find the COP when the water temperature is 72 °F, you can enter:

COP_72 = Regression(72.0)

Regression RefresherWhat Is Linear Regression?


How Do I Save Python Data Analysis Results?

The data can be saved using the same techniques we’ve been using. We need to: 

  1. Ensure that a folder for analyzed results is available. 

  2.  Create a new file name that clearly states what the file contains.

  3.  Save the data.

In this case we want to save the data to a new file called “Analyzed.” It should be in the same data folder and used to show the results of the analysis. We can do this with the following code:

Folder = Path + '\Analyzed'

if not os.path.exists(Folder):


The first line creates the path for the new folder. It adds ‘\Analyzed’ to the currently existing path and states it’s looking for a folder called “Analyzed” within the current folder. The second line determines whether or not that folder already exists. If it doesn’t, the third line creates it.

After that, we need to set the file names for both the data set and the coefficients. We can do this by combining what we already have with a subsection of the strings. We can use string indexes to identify the portion of the file name that we want to keep. 

For instance, the section of the file name for the first file that says “PerformanceMap_HPWH_50” states quite clearly what the file contains. Since we know that the last four characters of the file names are ‘.csv’ we can isolate that section of the string by using the indices [-26:-4]. In other words, we want the characters of the string running from “26th to last” to “4th to last” not including the 4th to last.

Following that we can customize the file names a bit. Namely, we can state that we want the data file name to state it contains analyzed data, and we want the coefficient file name to state it contains coefficients. We can write the file name for both files using the following lines:

Filename_Test = Folder + '\\' + Filename[-26:-4] + '_Analyzed.csv'

Filename_Coefficients = Folder + '\Coefficients_' +  Filename[-6:]

Then we simply save the files. We can save the analyzed data with the Pandas .to_csv function and we can save the coefficients with the NumPy .tofile function as follows:

Data.to_csv(Filename_Test, index = False)

Coefficients.tofile(Filename_Coefficients, sep = ‘,’)

Note that the line saving the data sets index = False. This means that the index of the data frame will not be saved when saving the table. Also note that the NumPy .tofile function requires you to specify a separator. In this case we’re using a comma, as specified with the sep = ‘,’ code.


How Do I Check My Python Data Mining Work?

There are an enormous number of things that could have gone wrong by this point in the process. Maybe the lab tester made some mistakes while running the experiments. Maybe an instrument broke. Maybe there’s a typo in the code, or an incorrect unit conversion.

It’s imperative to ensure none of these problems, or any others, occurred during the process. Therefore the next step in the process is checking the data set for errors. We’ll cover this in the next step of the tutorial. 

Ready to Move On?How to Check Your Data Analysis for Errors

* * * 

This multi-part tutorial will teach you all the skills you need to automate your laboratory data analysis and develop a performance map of heat pump water heaters. You can find the rest of the series here:

Need to Automate Your Data Analysis? Here’s How to Get Started.

How to Split Unmanageable Data Sets

How to Check Your Data Analysis for Errors

Don't DIY. Write Python Scripts to Check Data Quality for You. 

Generate Regressions in Python — Automatically!

Hiring Now
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics