Don’t DIY. Use Python Scripts to Check Data Quality for You.

Here’s how to write Python scripts to check your data for errors (minus the tedium of doing it yourself).

Written by Peter Grant
Published on Sep. 09, 2021
Don’t DIY. Use Python Scripts to Check Data Quality for You.
Brand Studio Logo

We’re in the process of writing Python scripts that will automatically analyze your data for you and store it with meaningful, intuitive file names to create a performance map. We're using an example taken from actual research, so you know that the skills you’re developing are useful in a real-world scenario.

In case you missed it, the first step introduced the concepts of the tutorial. Part two introduced the companion data set, and split the data into multiple files with user-friendly names.

In part three we created a script that analyzed each of the individual data files. Finally, in part four, I showed you how to create plots of the data, and how to examine those plots to ensure that the data provided by the laboratory was useful, and—perhaps most importantly—the data analysis process was correct. While this process was valuable, it was also manual. This is a tutorial teaching you to write programs to do that work for you, after all!

That’s what we'll cover in part five of this tutorial. We’ll amend our existing script to include logic that automatically identifies tests which probably contain errors. We’ll then save the file names of these tests in a .csv file, which we can use to explore which tests require further examination.

5 Steps to Checking Your Data for Errors

  1. Create a data frame file to store the information about potentially erroneous tests.
  2. Input expected values for each parameter, as well as thresholds for how much difference we’ll accept.
  3. After the program analyzes each test, create a series of if statements to see if the data falls inside or outside the accepted ranges.
  4. If the data falls outside of the accepted range, add information about that test to the data frame file describing potentially erroneous tests.
  5. Finally, save the data frame as a .csv file to make the information available for use after we complete the analysis.

If this seems abstract, don’t worry. It will all make sense after we walk through an example.

Start From the Top of This TutorialNeed to Automate Your Data Analysis? Here’s How.


How Does the Data Frame Store Information?

We’re working with a standard Pandas data frame. Once we create the data frame, it’ll be available as a structure so we can append information to it later.

The important part of creating this data frame is to ensure we have the correct headers. We’ll need a column for each piece of information we want to include in the data frame. I’ve personally found the useful columns for this are:

  • File name: This presents the name of the file with the suspicious results. Including this information tells us which tests to give further review after we complete the analysis..

  • Test parameters: This isn’t always necessary. If you include the test parameters in descriptive file names then this information will appear in the file name itself. If not, then including this column provides more information about potentially erroneous tests. I’ll include this column in the tutorial in case you want to use it in future analyses.

  • Code: This creates a column used to store a code stating the error. For instance, if a test included temperature measurements that were out of the specified range, this column could say something like “Water Temperature.”

  • Value: This column reports the value of the data that fell outside of the accepted range. You can use it to gauge whether the data was far outside of the range, or just barely beyond the specification.

With all this in mind, the code to create the data frame is as follows:

SuspiciousTests_Test = pd.DataFrame(columns = ['Filename', 'Test Parameters', 'Code', 'Value'])

Note this is being added to the script we’ve used previously and Pandas has already been imported as pd.


What Values Do We Add?

From a big picture point of view, we need to add all expected and threshold values required to check the data quality. If we’re concerned about air temperature, we need an expected and threshold value for that. If we’re concerned about electricity consumption, we need values for that. The point is: you must think through the components that are most important for your project and provide the information necessary to check those data sets.

For example, in part three of this tutorial, we used plots to check the data visually. We checked the starting water temperature, final water temperature and average ambient temperature. (We also checked the regression quality, but we’ll leave that out for now.) We'll also check for the maximum electricity consumption.

To complete these checks automatically we need to specify the expected and threshold values for:

  • Initial water temperature

  • Final water temperature

  • Water temperature standard deviation

  • Ambient temperature

  • Average temperature standard deviation

  • Maximum electricity demand

We can specify the required values with the following code:

#Constants describing the tests
Temperature_Water_Initial_Setpoint = 72
Temperature_Water_Final_Setpoint = 140
#Test rejection thresholds
Threshold_Temperature_Water_Initial = 1.8
Threshold_Temperature_Water_Final = 1.8
Threshold_StDev_Temperature_Water = 0.5
Threshold_Temperature_Ambient = 2
Threshold_StDev_Temperature_Ambient = 1
Threshold_P_Elec = 1000

Notice I haven’t specified expected value for the ambient temperature. This is because the expected value varies across tests, and the code will read it out of the data file name.


Write If Statements to Check for Errors

This is likely the most straightforward part of the process. We only need to write if statements comparing the values identified in the data to the values specified above. If the data falls out of the accepted range, then the if statement will be true and the script will identify a potential error.

Before programming if statements we need to identify the data required to process them. Some of this information comes from the file name (specifically, the ambient temperature set point) and some of it is calculated from the data.

You can program your script to read the ambient temperature set point with this code:

Temperature_Ambient_Set = float(Filename[-6:-4])
You can then calculate the ambient temperature average and standard deviation  as follows:
Temperature_Ambient_Avg = Data['T_Amb (deg F)'].mean()
Temperature_Ambient_StDev = np.std(Data['T_Amb (deg F)'])
You can pull the maximum electricity out of the data file with:
P_Elec_Max = Data['P_Elec (W)'].max()

Your script can read the initial and final water temperatures from the data files using:

Temperature_Water_Initial = Data.loc[0, 'Average Tank Temperature (deg F)']
Temperature_Water_Final = Data.loc[Data.index.max(), 'Average Tank Temperature (deg F)']

The standard deviation of water temperature in the tank is a bit harder. You can calculate it using the .std() Pandas function, but that requires a separate data frame as the function includes all columns in the entire data frame. Thus, you can calculate the standard deviation using the following lines of code:

StDev_Calculation = pd.DataFrame()
StDev_Calculation['T1 (deg F)'] = Data['T1 (deg F)']
StDev_Calculation['T2 (deg F)'] = Data['T2 (deg F)']
StDev_Calculation['T3 (deg F)'] = Data['T3 (deg F)']
StDev_Calculation['T4 (deg F)'] = Data['T4 (deg F)']
StDev_Calculation['T5 (deg F)'] = Data['T5 (deg F)']
StDev_Calculation['T6 (deg F)'] = Data['T6 (deg F)']
StDev_Calculation['T7 (deg F)'] = Data['T7 (deg F)']
StDev_Calculation['T8 (deg F)'] = Data['T8 (deg F)']
StDev_Calculation['Standard Deviation'] = StDev_Calculation.std(axis     = 1)
StDev_Max = StDev_Calculation['Standard Deviation'].max()

Now all of the values are identified and available, the next step is to insert the if statements. Remember this is a matter of creating an if statement to check for each potential error. This means if the script identifies a potential error it can tell which error it is and provide the most useful information in the results. You can program the if statements using:

if (abs(Temperature_Water_Initial - Temperature_Water_Initial_Setpoint) > Threshold_Temperature_Water_Initial):
if (abs(Temperature_Water_Final - Temperature_Water_Final_Setpoint) > Threshold_Temperature_Water_Final):
if (StDev_Max > Threshold_StDev_Temperature_Water):
if (abs(Temperature_Ambient_Set - Temperature_Ambient_Avg) > Threshold_Temperature_Ambient):
if (Temperature_Ambient_StDev > Threshold_StDev_Temperature_Ambient):
if (P_Elec_Max > Threshold_P_Elec):

These if statements provide the structure necessary to check for potential errors in the data. In the next section we'll add content within each of the if statements to provide useful data describing the identified problems.

More Tutorials From Peter GrantHow to Improve Your Control Flow Coding in Python


Describing the Potential Errors

Each of the previous if statements needs four lines of code to describe the identified potential problem. Two of them are repeatable and can be identified using the exact same line every time: the file name and the ambient temperature conditions of the test. They can be added to the results outside of the if statements.

The best way I’ve found to do this process is to create an identical, temporary, data frame to hold the information, then append it to the real data frame. We can create the temporary data frame, containing the file name and ambient temperature set point, using the following code:

Temp = pd.DataFrame(columns = ['Filename', 'Test Parameters', 'Code', 'Value'])
Temp.loc[0, 'Filename'] = Filename
Temp.loc[0, 'Test Parameters'] = 'T_Amb = ' + str(Temperature_Ambient_Set)

This creates a temporary data frame we can append to SuspiciousTests and contains the file name and ambient temperature in the first row.

After that it’s a matter of adding content within each if statement to state the appropriate code for that error, state the value of the data that’s outside the bounds and append the data to the SuspiciousTests data frame. We do this for the initial water temperature with the following code:

Temp.loc[0, 'Code'] = 'T_Water_Initial'
Temp.loc[0, 'Value'] = str(Temperature_Water_Initial - Temperature_Water_Initial_Setpoint)
SuspiciousTests = SuspiciousTests.append(Temp)

For the final water temperature:

Temp.loc[0, 'Code'] = 'T_Water_Final'
Temp.loc[0, 'Value'] = str(Temperature_Water_Final - Temperature_Water_Final_Setpoint)
SuspiciousTests = SuspiciousTests.append(Temp)

For the standard deviation of the water temperature:

Temp.loc[0, 'Code'] = 'T_Water_StDev'
Temp.loc[0, 'Value'] = str(StDev_Max)
SuspiciousTests = SuspiciousTests.append(Temp)

For the average ambient temperature:

Temp.loc[0, 'Code'] = 'T_Amb_Avg'
Temp.loc[0, 'Value'] = str(Temperature_Ambient_Avg - Temperature_Ambient_Avg)
SuspiciousTests = SuspiciousTests.append(Temp)

For the standard deviation of ambient temperature:

Temp.loc[0, 'Code'] = 'T_Amb_StDev'
Temp.loc[0, 'Value'] = str(Temperature_Ambient_StDev)
SuspiciousTests = SuspiciousTests.append(Temp)

And for the electric power demand:

Temp.loc[0, 'Code'] = 'P_Elec_Max'
Temp.loc[0, 'Value'] = str(P_Elec_Max)
SuspiciousTests = SuspiciousTests.append(Temp)


How Do I Save the File?

It’s easy to save the file using the techniques described in part two of this tutorial. In this case we want to save a single data frame as a .csv file, which will make the process much simpler. To save the file to a .csv named SuspiciousTests in the same folder we referenced as Path earlier in the tutorial, use:

SuspiciousTests.to_csv(Path + '\SuspiciousTests.csv', index = False)


Now What?

Now when you use your script to analyze all of the experimental data and generate the regressions, the script will also identify potential problems in the tests or data analysis. This information will be available in the SuspiciousTests data frame for later review, giving you guidance on which data files you should review more thoroughly. That sounds much better than manually checking every single one!

Now we’ve certified all of our laboratory data and data analysis is sound so we can move on to making use of the data. Since the goal of this tutorial is to create a performance map predicting the COP of a heat pump, as described in the introduction, we’ll move on to doing exactly that. Together we’ll create a two-dimensional regression stating the COP as a function of both the water and ambient temperatures. We’ll cover that topic in the sixth (and final!) part of this series.

Ready to Cross the Finish Line?Generate Regressions in Python — Automatically!

* * *

This multi-part tutorial will teach you all the skills you need to automate your laboratory data analysis and develop a performance map of heat pump water heaters. You can find the rest of the series here:

Need to Automate Your Data Analysis? Here’s How to Get Started.

How to Split Unmanageable Data Sets

These Python Scripts Will Automate Your Data Analysis

How to Check Your Data Analysis for Errors

Generate Regressions in Python—Automatically!

Hiring Now
Artificial Intelligence • Consumer Web • Edtech • HR Tech • Information Technology • Software • Conversational AI