Pandas works with many different types of data sets such as comma-separated values (CSV) files, Excel files, extensible markup language (XML) files, JavaScript object notation (JSON) files and relational database tables. 

Data read from these sources are returned as Pandas data types known as DataFrame and Series. Dealing with the same data types across the board is convenient because it allows us to read data from one data source, manipulate it and insert it into another data source without worrying about the data’s syntax.

Wes McKinney created Pandas in 2008 and released it as an open-source project in 2010, thereby allowing anyone to contribute to its development. McKinney wrote Pandas on top of another Python library called NumPy that offers functionality such as n-dimensional arrays. 

How Do I Start Using Pandas?

There are several ways to get started with Pandas. You can install Python and Pandas locally or use an online Jupyter Notebook that allows you to write and execute Python in a web browser.

 

What Is Pandas Used For?

Pandas has a wide range of use cases related to data analysis. We use it in everything from financial applications to scientific studies. For example, we can use Pandas for data wrangling in order to transform data into a representation more suitable for analytics in different scenarios. Pandas offers features for data wrangling such as merging, sorting, cleaning, grouping and visualization. 

Pandas also provides features to calculate descriptive statistics by giving access to the calculation of the mean, standard deviation, quartiles minimum, and maximum. We can also easily combine Pandas with other Python packages such as SciPy to calculate inferential statistics such as ANOVA or paired sample t-tests.

Pandas implements another Python package called Matplotlib used for data visualization to help us easily create everything from histograms and box plots to scatter plots.

More From Built In’s Tech DictionaryWhat Is a Database?

 

What Are the Key Features of Pandas?

DataFrame

Two key features in Pandas are the data structures, DataFrame and Series. A DataFrame represents 2D tabular data containing labeled columns and rows with data (see figure one below). 

Pandas provide a lot of flexibility to work with DataFrame objects. For example, you can update the values of cells through various selecting methods, insert and delete rows, or remove duplicate rows. DataFrame objects can also be merged to combine data from multiple data sets, and plotting a chart of the data can often be reduced to a simple line of code. 

Pandas DataFrame
Figure 1: A visualization of a Pandas DataFrame. | Image: Nicolai Berg Andersen

 

Series

The Series object represents a 1D array of labeled data. The Series object is closely connected to the DataFrame object because the data of a column in a DataFrame is contained in a Series object. 

Both the Series and DataFrame objects contain, by default, a numerical sequence of numbers starting from zero and incrementing by one for each row. This is known as the Index. The Index can also be a sequence of strings or dates instead of numbers, and a Series object is therefore similar to the Python Dictionary object in the sense it has a key for each value.

Complete Python Pandas Data Science Tutorial. | Video: Keith Galli

 

What Are Alternatives to Pandas?

There exist many analytics tools that have very similar functionality to Pandas, whether you are looking for a traditional GUI-based tool, another Python package, or a tool written in another programming language, it’s possible to find a suitable alternative. 

For example, traditional GUI-based spreadsheet software such as Microsoft Excel and Google Sheets, both contain methods to handle tabular data, import/export CSV, calculate different aggregation methods such as mean or average, and have the ability to visualize data. 

Examples of similar Python packages to Pandas are Polars and Vaex. Both Polars and Vaex are faster than Pandas at some operations when working with larger data sets, and offer similar functionality such as DataFrame objects, import/export CSV and aggregations methods. Both packages also support creating DataFrame objects from Pandas DataFrame objects.

If you are looking for alternatives in other programming languages, the JavaScript library Arquero, the Ruby library Rover or the programming language R might suit your needs. All three alternatives offer DataFrame object functionality to work with tabular data.

More From Nicolai Berg Andersen on Built InCreate the Classic Snake Game With Processing Library and Java

 

What Are the Advantages of Pandas?

With so many alternatives to Pandas, you might ask why you should use it over other tools, such as similar libraries or spreadsheet tools. After all, it’s possible to perform many of the same tasks with Microsoft Excel or Google Sheets. While Excel and Sheets are more closed environments available either in native software or through a web application, Pandas is available in a Python environment, thereby allowing us to implement it with many different Python functionality and APIs

Whether or not you’d use Pandas over similar Python packages such as Vaex or Polars may depend on the specific use case and the readability of the API. For example, Pandas has a method to read data directly from a relational database that’s not currently offered by Vaex API. On the other hand, Polars, like Pandas, also supports reading directly from a relational database. 

Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups
View 2901 Jobs

 

Getting Started With Pandas: A Tutorial

Install Pandas Locally

Before installing Pandas locally, you have to ensure you’ve installed Python. Both Python and Pandas are supported on major operating systems such as Microsoft Windows, Apple macOS and Linux Ubuntu. If you haven’t installed Python yet, visit the Python website and find the distribution matching your current platform. You can install Pandas with several different package manager tools such as pip or Anaconda. Before you do anything, I recommend reading the latest information about the different possibilities. 

That said, to install Pandas with pip simply enter the following into a terminal application:

$ pip install pandas

 

The Basics

Start by importing Pandas with the alias pd by entering the following into a Python script:

​​​​​​​import pandas as pd

Next, create a simple object with some test data. For example, an object containing data about the number of seconds a list of runners spent to complete a run in seconds.

runners = {
 "name": [
 "Byron Hammond",
 "Lacey Austin",
 "Rebecca Bauer",
 "Yasir Burnett",
 "Clarence Rojas"
 ],
 "time": [120.2, 123.3, 133.6, 145.4, 160.2]
}

After creating the test data, initialize a Pandas DataFrame object by calling pd.DataFrame with the data object as an argument:

df = pd.DataFrame(runners)

If you use Jupyter Notebook, you can simply output the DataFrame object by adding a line with the name of the variable at the bottom of the code block, and if you created a Python script where the code is outputted to a console, you have to print the variable using the following print method.

# Output the DataFrame in Jupyter Notebook
df

# Output the DataFrame in the console
print(df)

The console output of the DataFrame is very similar to the Jupyter Notebook output (see figures two and three) below. 

Pandas DataFrame output in Jupyter notebook
Figure 2: DataFrame output in Jupyter Notebook. | Image: Nicolai Berg Andersen
Pandas DataFrame output in the console
Figure 3: DataFrame output in the console. | Image: Nicolai Berg Andersen

Pandas provides a method called head() you can use to output the beginning of a DataFrame or a Series object.

df.head(2)

You can pass an integer to the method to define the number of rows you want to return. If no integer is passed, the default number of rows is automatically set to five. You can see in figure four below that the method returns the rows with indexes zero and one.

Pandas The first two rows of the DataFrame object
Figure 4: The first two rows of the DataFrame object: df; | Image: Nicolai Berg Andersen

Pandas also provides another method called tail() you can use to output the ending of a DataFrame or a Series object.

df.tail(2)

As with the method head(), you can pass an integer to define the number of rows, and the default number is five. Figure five shows the method returns the rows with indexes three and four.

Pandas the last two rows of the DataFrame object
Figure 5: The last two rows of the DataFrame object: df; | Image: Nicolai Berg Andersen

Each column of the DataFrame object is represented as a Series object. To get a specific column, insert the name of the column between square brackets after the name of the variable.

time = df["time"]

As you can see in figure six, the Series object is a list with the time information wherein each row has an index like the DataFrame object.

The Series object containing the runners’ complete time.
Figure 6: The Series object containing the runners’ complete time. | Image: Nicolai Berg Andersen

Pandas provides many different ways to get data from a DataFrame or Series object. For example, another method is to use boolean operations by calling the method loc().

time2 = df.loc[(df["time"] > 130), "time"]

You can see in figure seven that the method returns a Series object containing the last three runners because their time property is greater than 130.

Pandas Runners with a complete timer over 130 found by executing the method: loc(); with a boolean expression
Figure 7: Runners with a complete timer over 130 found by executing the method: loc(); with a boolean expression. | Image: Nicolai Berg Andersen

Related Reading From Built In ExpertsA Guide to time Series Analysis in Python

 

Descriptive Statistics

To calculate a descriptive statistic for a DataFrame or Series object, use the method describe().

df.describe()

You can see in figure eight that the method returns the number of runners (count), the mean, standard deviation (std), minimum and maximum, and the three quartiles (25 percent, 50 percent and 75 percent).

Pandas output from the method: describe()
Figure 8: The output from the method: describe(). | Image: Nicolai Berg Andersen

You can use the boxplot() method to visualize the statistical data returned by the describe() method.

df.boxplot()

In figure nine, you can see that Pandas returns a boxplot over the data.

Pandas boxplot over the runners generated by the Pandas method: boxplot()
Figure 9: A boxplot over the runners generated by the Pandas method: boxplot(). | Image: Nicolai Berg Andersen
Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups
View 2901 Jobs

 

Correlation

You can also use Pandas to calculate the correlation between multiple data sets by using corr(). First, create some test data by creating a range of dates using the method date_range() and define an object containing the price of two different stocks.

dates = pd.date_range('1/1/2022', periods=8)
stocks = {
 "stock1": [104, 113, 114, 115, 120, 116, 123, 125],
 "stock2": [101, 104, 109, 103, 113, 116, 111, 114]
}

Next, initialize the DataFrame object and call the method corr(). Notice that the DataFrame object initializes using both the data object and an index (instead of only the data object as in the earlier example) to specify each row is identified by a date. The dates are not important for the method corr() but will be convenient later when plotting the two stocks’ graphs.

df = pd.DataFrame(stocks, index=dates)
df.corr()

The method corr() returns the Pearson correlation coefficient by default but it’s possible to use the Kendall rank correlation coefficient or Spearman’s rank correlation coefficient by passing either kendall or spearman as an argument. As you can see in figure 10, the correlation coefficient between stock1 and stock2 is 0.7.

Pandas correlation matrix returned by the method corr()
Figure 10: A correlation matrix returned by the method corr().| Image: Nicolai Berg Andersen

Another way to visualize the result of corr() is to display a heatmap. You can do this quite easily by combining the Pandas DataFrame object with another Python package called Seaborn. Import the package as sns and call the method heatmap() with the correlation matrix as an argument.

import seaborn as sns
sns.heatmap(df.corr())

As you can see in figure 11, the method heatmap() returns a heatmap over the different values in the correlation matrix.

Pandas heatmap over the correlation matrix returned by corr()
Figure 11: A heatmap over the correlation matrix returned by corr(). | Image: Nicolai Berg Andersen

Finally, Pandas has a method called plot() that you can use to see a simple line graph over the two stock prices.

df.plot()

You can see in the figure below that Pandas output a graph where the x-axis specifies the DataFrame object’s indexes and the y-axis specifies the stocks’ prices. 

Pandas stock prices line graph
Figure 12: The stock prices line graph. | Image: Nicolai Berg Andersen

As shown in the examples above, you can easily use Pandas DataFrame and Series objects to analyze many types of data sets. However, the examples only show a few of the possibilities that Pandas has to offer and you might want to look into other use cases, such as how to use Pandas’ group by functionality or how to handle missing data.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us