How to Filter Pandas DataFrames

Pandas is a popular data analysis and manipulation library for Python. The core data structure of Pandas is DataFrame, which stores data in tabular form with labeled rows and columns.

8 Pandas Filter Methods to Know

Logical operators
Multiple logical operators
Isin
Str accessor
Tilde (~)
Query
Nlargest and nsmallest
Loc and iloc

A common operation in data analysis is to filter values based on a condition or multiple conditions. Pandas provides a variety of ways to filter data points (i.e. rows). In this article, we’ll cover eight different ways to filter a DataFrame.

A tutorial on how to filter in Pandas. | Video: Corey Schafer

How to Filter Pandas DataFrame

We start by importing the libraries.

import numpy as np
import pandas as pd

Let’s create a sample DataFrame for our examples.

df = pd.DataFrame({

'name':['Jane','John','Ashley','Mike','Emily','Jack','Catlin'],
'ctg':['A','A','C','B','B','C','B'],
'val':np.random.random(7).round(2),
'val2':np.random.randint(1,10, size=7)

})

DataFrame image. | Screenshot: Soner Yildirim

More on Pandas: A Guide to Pandas Pivot Table

1. Logical Operators

We can use the logical operators on column values to filter rows.


df[df.val > 0.5]

       name       ctg      val       val2      
-------------------------------------------               
1       John        A       0.67       1                
3       Mike        B       0.91       5                
4       Emily       B       0.99       8                
6       Catlin      B       1.00       3

We’ve now selected the rows in which the value in the “val” column is greater than 0.5.

The logical operators function also works on strings.

f[df.name > 'Jane']
        name       ctg      val       val2       
-------------------------------------------                
1       John        A       0.67       1                 
3       Mike        B       0.91       5

Only the names that come after “Jane” in alphabetical order are selected.

2. Multiple Logical Operators

Pandas allows for combining multiple logical operators. For instance, we can apply conditions on both val and val2 columns.

df[(df.val > 0.5) & (df.val2 == 1)]
        name       ctg      val       val2       
-------------------------------------------                
1       John        A       0.67       1

The “&” signs stands for “and,” the “|” stands for “or.”

df[(df.val < 0.5) | (df.val2 == 7)]
        name      ctg      val       val2      
-------------------------------------------               
0       Jane        A       0.43       1                
2     Ashley        C       0.40       7                
5       Jack        C       0.02       7

3. Isin

The isin method is another way of applying multiple conditions for filtering. For instance, we can filter the names that exist in a given list.

names = ['John','Catlin','Mike']
df[df.name.isin(names)]
        name      ctg      val       val2       
-------------------------------------------                
1       John       A       0.67       1                 
3       Mike       B       0.91       5                 
6     Catlin       B       1.00       3

4. Str accessor

Pandas is a highly efficient library on textual data as well. The functions and methods under the str accessor provide flexible ways to filter rows based on strings.

For instance, we can select the names that start with the letter “J.”

df[df.name.str.startswith('J')]
        name      ctg      val       val2       
-------------------------------------------                
0       Jane       A       0.43       1                 
1       John       A       0.67       1                 
5       Jack       C       0.02       7

The contains function under the str accessor returns the values that contain a given set of characters.

df[df.name.str.contains('y')]
         name      ctg      val       val2       
-------------------------------------------                
2      Ashley       C       0.40       7                 
4       Emily       B       0.99       8

We can pass a longer set of characters to the contains function depending on the strings in the data.

5. Tilde (~)

The tilde operator is used for “not” logic in filtering. If we add the tilde operator before the filter expression, the rows that do not fit the condition are returned.

df[~df.name.str.contains('J')]
         name      ctg      val       val2       
-------------------------------------------                
2      Ashley       C       0.40       7
3        Mike       B       0.91       5
4       Emily       B       0.99       8
6      Catlin       B       1.00       3

We get the names that do not start with the letter “J.”

6. Query

The query function offers a little more flexibility at writing the conditions for filtering. We can pass the conditions as a string.

For instance, the following code returns the rows that belong to the B category and have a value higher than 0.5 in the val column.

df.query('ctg == "B" and val > 0.5')
        name      ctg      val      val2       
-------------------------------------------                
3       Mike       B       0.91       5                 
4      Emily       B       0.99       8                 
6     Catlin       B       1.00       3

7. Nlargest or Nsmallest

In some cases, we do not have a specific range for filtering but just need the largest or smallest values. The nlargest and nsmallest functions allow you to select rows that have the largest or smallest values in a column, respectively.

df.nlargest(3, 'val')
        name      ctg      val      val2       
-------------------------------------------                
6     Catlin       B       1.00       3                 
4      Emily       B       0.99       8                 
3       Mike       B       0.91       5

We can also specify the number of largest or smallest values to be selected and the name of the column.

df.nsmallest(2, 'val2')
        name      ctg      val       val2       
-------------------------------------------                
0       Jane       A       0.43       1                 
1       John       A       0.67       1

More on Pandas: Beware the Dummy Variable Trap in Pandas

8. Loc and Iloc

The loc and iloc methods are used to select rows or columns based on index or label.

Loc: Select rows or columns using labels
Iloc: Select rows or columns using indices

Thus, they can be used for filtering. However, we can only select a particular part of the DataFrame without specifying a condition.

df.iloc[3:5, :] #rows 3 and 4, all columns
        name      ctg      val       val2       
-------------------------------------------                
3       Mike       B       0.91       5                 
4      Emily       B       0.99       8

If the DataFrame has an integer index, the indices and labels of the rows are the same. Thus, both loc and iloc accomplished the same thing on the rows.

df.loc[3:5, :] #rows 3 and 4, all columns
        name      ctg      val       val2       
-------------------------------------------                
3       Mike       B       0.91       5                 
4      Emily       B       0.99       8

Let’s update the index of the DataFrame to demonstrate the difference between loc and iloc better.

df.index = ['a','b','c','d','e','f','g']

We cannot pass integers to the loc method now because the labels of indices are letters.

df.loc['b':'d', :]
        name      ctg      val       val2       
-------------------------------------------                
b       John       A       0.67       1                 
c     Ashley       C       0.40       7                 
d       Mike       B       0.91       5

We have now covered eight different ways of filtering rows in a DataFrame. All of them are useful and come in handy in different situations.

Pandas is a powerful library for both data analysis and manipulation. It provides numerous functions and methods to handle data in tabular form. As with any other tool, the best way to learn Pandas is through practice.

Frequently Asked Questions

What is panda filtering?

Filtering in Pandas means to subset (or display) certain rows and columns in a Pandas DataFrame based on specified conditions. The dataframe.filter() function is one method of filtering a DataFrame in Pandas.

How do I filter lines in Pandas?

Lines, or rows, in a Pandas DataFrame can be filtered by using one of the following methods:

Filter by logical operators: df.values, df.name, etc.
Filter by list of values: isin()
Filter by string: str.startswith(), str.endswith() or str.contains()
Filter based on query: query()
Filter by largest or smallest value for specified column: nlargest() or nsmallest()
Filter by label or index: loc[] or iloc[]

How do I filter out a condition in Pandas?

Conditional filtering can be done in Pandas by using query(), loc[] and iloc[], filter() or apply() functions.