How to Speed Up Your Pandas Code by 10x

OK, actually just get out of Pandas. Try NumPy, instead.

Written by Peter Grant
Published on Dec. 16, 2021
How to Speed Up Your Pandas Code by 10x
Brand Studio Logo

As we all know, Pandas is a fantastic data science tool. It provides us with the DataFrame structure we need and powerful computational abilities, all in a user-friendly format. It even has quality documentation and a large support network, which makes it an easy library to learn. Pandas is all around excellent.

But Pandas isn’t particularly fast.

When you’re dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you’re dealing with millions of computations, your total computation time stretches on and on and on....

Need To Speed Up Pandas? Try NumPy Instead.

NumPy has all of the computation capabilities of Pandas, but uses pre-compiled, optimized methods. This mean NumPy can be significantly faster than Pandas. Converting a DataFrame from Pandas to NumPy is relatively straightforward. You can use the dataframes .to_numpy() function to automatically convert it, then create a dictionary of the column names to enable accessing each cell (similar to the Pandas .loc function).

This is a common problem in the work I do. When I develop simulation models representing common equipment in buildings, I create functions emulating the heat transfer processes and control logic decisions in a given piece of equipment. I then pass data describing the building conditions and occupant behavior choices into those models. The models predict what the equipment will do, how well it will satisfy the occupants needs and how much energy it will consume.

To do that, the models need to be time-based and be able to calculate what happens at a given point in the simulation. Only then can the model move on to the next set of calculations. In other words, the outputs at one particular time are the inputs for the next time. For example, imagine predicting the temperature in your oven at any point in time. Is it currently heating? How much has the temperature increased since the last time you checked? What was the temperature at that time?

This dependence on the previous time leads to a problem: We can’t use vector calculations. We must use the dreaded for loops. For loops are slow. So what do we do?

Want to Learn More? Start Here.Need to Automate Your Data Analysis? Here’s How.

 

Try NumPy

One solution (whether or not it’s possible to vectorize calculations) is to convert your calculations to NumPy. Numpy has all of the computation capabilities of Pandas, but performs them without carrying as much overhead information while also using precompiled, optimized methods.

According to Sofia Heisler at Upside Engineering Blog, NumPy performs a lot of background information using precompiled C code. This precompiled C code makes NumPy significantly faster than Pandas by skipping the compiling step and including pre-programmed speed optimizations. Additionally, NumPy drops a lot of the information you find in Pandas. Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn’t do any of that, so it can perform the same calculations significantly faster.

There are multiple ways to convert Pandas data to NumPy.

You can convert a series using the .values method. This creates the same series in NumPy. 

Here’s an example: 

import pandas as pd
Series_Pandas = pd.Series(data=[1, 2, 3, 4, 5, 6])
Series_Numpy = Series_Pandas.values

You can convert a DataFrame using the .to_numpy() function. This creates an int64 object with the same values in NumPy. Note this does not keep the column names, and you need to create a dictionary converting your Pandas column names to NumPy column numbers. You can accomplish this with the following code:

import pandas as pd
import numpy as np
Dataframe_Pandas = pd.DataFrame(data=[[0,1], [2,3], [4,5]], columns = ['First Column', 'Second Column'])
Dataframe_Numpy = Dataframe_Pandas.to_numpy()
Column_Index_Dictionary = dict(zip(Dataframe_Pandas.columns, 
list(range(0,len(Dataframe_Pandas.columns)))))

That code converts the DataFrame to a NumPy int64 object and provides all of the tools you need to iterate through each line and edits values in specific columns — all in a user-friendly manner. You can call each cell in a manner similar to the Pandas .loc function with NumPy indexing by following the structure int64object[row, Dictionary[‘Pandas Column Name’]]

For instance, if you want to set the value in the first row of Second Column to nine you can use the following code:

Dataframe_Numpy[0, Column_Index_Dictionary['Second Column']] = 9

More Python Tutorials With Peter4 Python Tools to Simplify Your Life

 

How Much Faster Does This Make My Code?

Speed will vary from one project to the next; some scripts will see more improvement by switching to NumPy than others. It depends on the types of calculations used in your script and the percentage of all calculations converted to NumPy. That said, the results can be drastic.

Vectorizing With Pandas and NumPy

For example, I recently used this process to convert one of my simulation models from a Pandas base to a NumPy base. The original, Pandas-based model required 362 seconds (about six minutes) to perform an annual simulation. That’s not bad if you’re running one simulation with the model, but what if you’re running a thousand? That’s 100 hours. After converting the core of the model to NumPy, the same annual simulation required 32 seconds to calculate. Those same 1000 simulations would only take 8 hours with NumPy.

That’s 9 percent as much time to do the same thing. In this instance, I saw a more than 10x increase in speed after converting my code from Pandas to NumPy. 

Data Science Deep Dives on Built InUse Precision and Recall to Evaluate Your Classification Model

 

Hiring Now
Adyen
Fintech • Payments • Financial Services
SHARE