Pandas is a popular data analysis and manipulation library for Python. The core data structure of Pandas is DataFrame, which stores data in tabular form with labeled rows and columns.
8 Pandas Filter Methods to Know
- Logical operators
- Multiple logical operators
- Isin
- Str accessor
- Tilde (~)
- Query
- Nlargest and nsmallest
- Loc and iloc
A common operation in data analysis is to filter values based on a condition or multiple conditions. Pandas provides a variety of ways to filter data points (i.e. rows). In this article, we’ll cover eight different ways to filter a DataFrame.
How to Filter Pandas DataFrame
We start by importing the libraries.
import numpy as np
import pandas as pd
Let’s create a sample DataFrame for our examples.
df = pd.DataFrame({
'name':['Jane','John','Ashley','Mike','Emily','Jack','Catlin'],
'ctg':['A','A','C','B','B','C','B'],
'val':np.random.random(7).round(2),
'val2':np.random.randint(1,10, size=7)
})
1. Logical Operators
We can use the logical operators on column values to filter rows.
df[df.val > 0.5]
name ctg val val2
-------------------------------------------
1 John A 0.67 1
3 Mike B 0.91 5
4 Emily B 0.99 8
6 Catlin B 1.00 3
We’ve now selected the rows in which the value in the “val” column is greater than 0.5.
The logical operators function also works on strings.
f[df.name > 'Jane']
name ctg val val2
-------------------------------------------
1 John A 0.67 1
3 Mike B 0.91 5
Only the names that come after “Jane” in alphabetical order are selected.
2. Multiple Logical Operators
Pandas allows for combining multiple logical operators. For instance, we can apply conditions on both val and val2 columns.
df[(df.val > 0.5) & (df.val2 == 1)]
name ctg val val2
-------------------------------------------
1 John A 0.67 1
The “&” signs stands for “and,” the “|” stands for “or.”
df[(df.val < 0.5) | (df.val2 == 7)]
name ctg val val2
-------------------------------------------
0 Jane A 0.43 1
2 Ashley C 0.40 7
5 Jack C 0.02 7
3. Isin
The isin
method is another way of applying multiple conditions for filtering. For instance, we can filter the names that exist in a given list.
names = ['John','Catlin','Mike']
df[df.name.isin(names)]
name ctg val val2
-------------------------------------------
1 John A 0.67 1
3 Mike B 0.91 5
6 Catlin B 1.00 3
4. Str accessor
Pandas is a highly efficient library on textual data as well. The functions and methods under the str
accessor provide flexible ways to filter rows based on strings.
For instance, we can select the names that start with the letter “J.”
df[df.name.str.startswith('J')]
name ctg val val2
-------------------------------------------
0 Jane A 0.43 1
1 John A 0.67 1
5 Jack C 0.02 7
The contains
function under the str accessor returns the values that contain a given set of characters.
df[df.name.str.contains('y')]
name ctg val val2
-------------------------------------------
2 Ashley C 0.40 7
4 Emily B 0.99 8
We can pass a longer set of characters to the contains function depending on the strings in the data.
5. Tilde (~)
The tilde operator is used for “not” logic in filtering. If we add the tilde operator before the filter expression, the rows that do not fit the condition are returned.
df[~df.name.str.contains('J')]
name ctg val val2
-------------------------------------------
2 Ashley C 0.40 7
3 Mike B 0.91 5
4 Emily B 0.99 8
6 Catlin B 1.00 3
We get the names that do not start with the letter “J.”
6. Query
The query
function offers a little more flexibility at writing the conditions for filtering. We can pass the conditions as a string.
For instance, the following code returns the rows that belong to the B category and have a value higher than 0.5 in the val column.
df.query('ctg == "B" and val > 0.5')
name ctg val val2
-------------------------------------------
3 Mike B 0.91 5
4 Emily B 0.99 8
6 Catlin B 1.00 3
7. Nlargest or Nsmallest
In some cases, we do not have a specific range for filtering but just need the largest or smallest values. The nlargest
and nsmallest
functions allow you to select rows that have the largest or smallest values in a column, respectively.
df.nlargest(3, 'val')
name ctg val val2
-------------------------------------------
6 Catlin B 1.00 3
4 Emily B 0.99 8
3 Mike B 0.91 5
We can also specify the number of largest or smallest values to be selected and the name of the column.
df.nsmallest(2, 'val2')
name ctg val val2
-------------------------------------------
0 Jane A 0.43 1
1 John A 0.67 1
8. Loc and Iloc
The loc
and iloc
methods are used to select rows or columns based on index or label.
- Loc: Select rows or columns using labels
- Iloc: Select rows or columns using indices
Thus, they can be used for filtering. However, we can only select a particular part of the DataFrame without specifying a condition.
df.iloc[3:5, :] #rows 3 and 4, all columns
name ctg val val2
-------------------------------------------
3 Mike B 0.91 5
4 Emily B 0.99 8
If the DataFrame has an integer index, the indices and labels of the rows are the same. Thus, both loc
and iloc
accomplished the same thing on the rows.
df.loc[3:5, :] #rows 3 and 4, all columns
name ctg val val2
-------------------------------------------
3 Mike B 0.91 5
4 Emily B 0.99 8
Let’s update the index of the DataFrame to demonstrate the difference between loc and iloc better.
df.index = ['a','b','c','d','e','f','g']
We cannot pass integers to the loc method now because the labels of indices are letters.
df.loc['b':'d', :]
name ctg val val2
-------------------------------------------
b John A 0.67 1
c Ashley C 0.40 7
d Mike B 0.91 5
We have now covered eight different ways of filtering rows in a DataFrame. All of them are useful and come in handy in different situations.
Pandas is a powerful library for both data analysis and manipulation. It provides numerous functions and methods to handle data in tabular form. As with any other tool, the best way to learn Pandas is through practice.
Frequently Asked Questions
What is panda filtering?
Filtering in Pandas means to subset (or display) certain rows and columns in a Pandas DataFrame based on specified conditions. The dataframe.filter() function is one method of filtering a DataFrame in Pandas.
How do I filter lines in Pandas?
Lines, or rows, in a Pandas DataFrame can be filtered by using one of the following methods:
- Filter by logical operators: df.values, df.name, etc.
- Filter by list of values: isin()
- Filter by string: str.startswith(), str.endswith() or str.contains()
- Filter based on query: query()
- Filter by largest or smallest value for specified column: nlargest() or nsmallest()
- Filter by label or index: loc[] or iloc[]
How do I filter out a condition in Pandas?
Conditional filtering can be done in Pandas by using query(), loc[] and iloc[], filter() or apply() functions.