
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 7, 2025
In this tutorial, we’ll learn how to filter a Pandas dataframe (an instance of Pandas.DataFrame), similar to filtering a table using the WHERE clause in SQL.
We’ll learn three methods of doing so:
They’re important to learn because Pandas is used in many tasks ranging from data science to training ML models.
We’ll focus on the case where the condition involves a single column, and the approaches can be adapted to work with multiple columns as well. We tested the code using Pandas 2.2.1, but it should work on the other modern versions of Pandas, too.
Here’s the data frame we’ll use in our examples:
import pandas as pd
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 1, 2],
'b' : ['VAA', 'WBB', 'XCC', 'YDD', 'ZAA', 'SBC', 'TCC'],
'c' : pd.to_datetime(['2025-01-01', '2025-03-01', '2024-08-04',
'2024-11-08', '2023-09-02', '2025-02-05', '2025-01-07'])
})
It has an integer column a, a string column b, and a DateTime column c.
We can create a Boolean mask by evaluating relational expressions involving columns.
The general syntax is:
df[col] op value
# or
value op df[col]
where:
For each element x of df[col], Pandas internally evaluates x op value and returns a Boolean mask of True and False. We can use it as a logical index to retrieve only the rows where the mask is True:
df[df[col] op value]
Let’s check a few examples.
We can use all the relational operators in Python to compare a column to a value: <, >, <=, >=, ==, and !=. For example:
df[df['a'] < 3]
returns
a b c
0 1 VAA 2025-01-01
1 2 WBB 2025-03-01
5 1 SBC 2025-02-05
6 2 TCC 2025-01-07
If we want to get the columns where a condition isn’t fulfilled, we use ~ to flip the Boolean mask (convert True to False and False to True):
df[~(df['a'] < 3)]
We get:
a b c
2 3 XCC 2024-08-04
3 4 YDD 2024-11-08
4 5 ZAA 2023-09-02
We must put the original mask in the parentheses for the correct results.
If we want to test for equality with several values, we should use the isin operator:
df[df['a'].isin([1, 5])]
The result is:
a b c
0 1 VAA 2025-01-01
4 5 ZAA 2023-09-02
5 1 SBC 2025-02-05
The operator accepts an iterable of values and checks if column elements are its members.
We can use said relational operators with string columns and string values.
However, we must reference the str accessor to use string-specific methods such as contains(), lower(), upper(), startswith(), endswith(), split(), and others:
# df['b'] ends with 'C'
df[df['b'].str.endswith('C')]
# df['b'] contains 'C'
df[df['b'].str.contains('DD')]
# df['b'], transformed to lowercase, contains 'bc'
df[df['b'].str.lower().str.contains('bc')]
Here’s a list of commonly used methods:
Similarly, we access DateTime methods and properties using the dt accessor:
# df['c'] year is odd
df[df['c'].dt.year % 2 == 1]
# df['c'] month is September
df[df['c'].dt.month == 9]
# or
df[df['c'].dt.month_name() == 'September']
# df['c'] weekday is Monday, Tuesday or Friday
df['c'].weekday.isin([1, 2, 5])
# or
df['c'].dt.day_name().isin(['Monday', 'Tuesday', 'Friday'])
Here’s a list of commonly used constructs:
The query() method of Pandas.DataFrame can shorten the code with Boolean indexing. Instead of creating a mask and using it as a logical index, we write conditions as strings referencing column names and values. For example:
df.query('a < 3')
is equivalent to df[df[‘a’] < 3] but shorter.
If a column name contains spaces, we enclose it with backticks:
# Create a new column
df['a new column'] = df['a'] + 1
Here’s the updated dataframe:
a b c a new column
0 1 VAA 2025-01-01 2
1 2 WBB 2025-03-01 3
2 3 XCC 2024-08-04 4
3 4 YDD 2024-11-08 5
4 5 ZAA 2023-09-02 6
5 1 SBC 2025-02-05 2
6 2 TCC 2025-01-07 3
Now, let’s filter it by the new column, whose name contains spaces:
df.query('`a new column` < 3')
The result is:
a b c a new column
0 1 VAA 2025-01-01 2
5 1 SBC 2025-02-05 2
</code
We can use a Python variable in the query string if we prepend its name with @:
x = 3
df.query('a < @x')
This returns:
a b c a new column
0 1 VAA 2025-01-01 2
1 2 WBB 2025-03-01 3
5 1 SBC 2025-02-05 2
6 2 TCC 2025-01-07 3
We’ll get UndefinedVariableError if the referenced variable doesn’t exist in the current scope.
Several SQL interfaces allow us to query a Pandas dataframe like an SQL table. We’ll show how to use DuckDB.
We can install it using pip:
pip install duckdb
Then, we treat the dataframe variable as a table name whose columns we can use in WHERE and SELECT clauses. We pass the query string to the query() function of the duckdb module:
import duckdb
result = duckdb.query('SELECT * FROM df WHERE a < 3')
The result variable is a DuckDB class for table relations, whose string form is formatted as a table:
┌───────┬─────────┬─────────────────────┬──────────────┐
│ a │ b │ c │ a new column │
│ int64 │ varchar │ timestamp_ns │ int64 │
├───────┼─────────┼─────────────────────┼──────────────┤
│ 1 │ VAA │ 2025-01-01 00:00:00 │ 2 │
│ 2 │ WBB │ 2025-03-01 00:00:00 │ 3 │
│ 1 │ SBC │ 2025-02-05 00:00:00 │ 2 │
│ 2 │ TCC │ 2025-01-07 00:00:00 │ 3 │
└───────┴─────────┴─────────────────────┴──────────────┘
We can call the df() methods to convert it to Pandas.DataFrame:
df_result = duckdb.query('SELECT * FROM df WHERE a < 3').df()
Which approach is the fastest? It depends on the dataframe size, the query, and the environment. In our test, we created a dataframe with five integer columns and a million rows of random values:
import numpy as np
import pandas as pd
generator = np.random.default_rng(19)
random_data = generator.choice(100, size=(1000000, 5))
df = pd.DataFrame(random_data, columns=['a', 'b', 'c', 'd', 'e'])
We used timeit with 30 runs and automatically determined number of loops to compare run times:
Method |
Mean ± std. dev. of 30 runs |
Loops in a Run |
---|---|---|
duckdb.query(‘SELECT * FROM df WHERE a < 20’) |
722 µs ± 298 µs per loop |
1000 |
df[df[‘a’] < 20] | 9.3 ms ± 1.76 ms per loop | 100 |
df.query(‘a < 20’) | 24.6 ms ± 6.16 ms per loop | 10 |
In this particular test, DuckDb was the fastest, while boolean indexing worked faster than the query() method.
In this article, we explored how to filter a Pandas dataframe based on a single column.
We presented three methods: boolean indexing, the query() method of Pandas.DataFrame, and the SQL interface DuckDb.
In our benchmark test, DuckDb worked the best, but that need not be the case in every particular application.