As we all know, Pandas is a fantastic data science tool. It provides us with the DataFrame structure we need and powerful computational abilities, all in a user-friendly format. It even has quality documentation and a large support network, which makes it an easy library to learn. Pandas is all around excellent.
But Pandas isn’t particularly fast.
When you’re dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you’re dealing with millions of computations, your total computation time stretches on and on and on....
Need To Speed Up Pandas? Try NumPy Instead.
This is a common problem in the work I do. When I develop simulation models representing common equipment in buildings, I create functions emulating the heat transfer processes and control logic decisions in a given piece of equipment. I then pass data describing the building conditions and occupant behavior choices into those models. The models predict what the equipment will do, how well it will satisfy the occupants needs and how much energy it will consume.
To do that, the models need to be time-based and be able to calculate what happens at a given point in the simulation. Only then can the model move on to the next set of calculations. In other words, the outputs at one particular time are the inputs for the next time. For example, imagine predicting the temperature in your oven at any point in time. Is it currently heating? How much has the temperature increased since the last time you checked? What was the temperature at that time?
This dependence on the previous time leads to a problem: We can’t use vector calculations. We must use the dreaded for loops. For loops are slow. So what do we do?
Try NumPy
One solution (whether or not it’s possible to vectorize calculations) is to convert your calculations to NumPy. Numpy has all of the computation capabilities of Pandas, but performs them without carrying as much overhead information while also using precompiled, optimized methods.
According to Sofia Heisler at Upside Engineering Blog, NumPy performs a lot of background information using precompiled C code. This precompiled C code makes NumPy significantly faster than Pandas by skipping the compiling step and including pre-programmed speed optimizations. Additionally, NumPy drops a lot of the information you find in Pandas. Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn’t do any of that, so it can perform the same calculations significantly faster.
There are multiple ways to convert Pandas data to NumPy.
You can convert a series using the .values
method. This creates the same series in NumPy.
Here’s an example:
import pandas as pd
Series_Pandas = pd.Series(data=[1, 2, 3, 4, 5, 6])
Series_Numpy = Series_Pandas.values
You can convert a DataFrame using the .to_numpy()
function. This creates an int64 object with the same values in NumPy. Note this does not keep the column names, and you need to create a dictionary converting your Pandas column names to NumPy column numbers. You can accomplish this with the following code:
import pandas as pd
import numpy as np
Dataframe_Pandas = pd.DataFrame(data=[[0,1], [2,3], [4,5]], columns = ['First Column', 'Second Column'])
Dataframe_Numpy = Dataframe_Pandas.to_numpy()
Column_Index_Dictionary = dict(zip(Dataframe_Pandas.columns,
list(range(0,len(Dataframe_Pandas.columns)))))
That code converts the DataFrame to a NumPy int64 object and provides all of the tools you need to iterate through each line and edits values in specific columns — all in a user-friendly manner. You can call each cell in a manner similar to the Pandas .loc function with NumPy indexing by following the structure int64object[row, Dictionary[‘Pandas Column Name’]]
.
For instance, if you want to set the value in the first row of Second Column to nine you can use the following code:
Dataframe_Numpy[0, Column_Index_Dictionary['Second Column']] = 9
How Much Faster Does This Make My Code?
Speed will vary from one project to the next; some scripts will see more improvement by switching to NumPy than others. It depends on the types of calculations used in your script and the percentage of all calculations converted to NumPy. That said, the results can be drastic.
For example, I recently used this process to convert one of my simulation models from a Pandas base to a NumPy base. The original, Pandas-based model required 362 seconds (about six minutes) to perform an annual simulation. That’s not bad if you’re running one simulation with the model, but what if you’re running a thousand? That’s 100 hours. After converting the core of the model to NumPy, the same annual simulation required 32 seconds to calculate. Those same 1000 simulations would only take 8 hours with NumPy.
That’s 9 percent as much time to do the same thing. In this instance, I saw a more than 10x increase in speed after converting my code from Pandas to NumPy.