Select Rows in Pandas Dataframes Based on a Column

1. Introduction

In this tutorial, we’ll learn how to filter a Pandas dataframe (an instance of Pandas.DataFrame), similar to filtering a table using the WHERE clause in SQL.

We’ll learn three methods of doing so:

Boolean indexing
the query() method
SQL interfaces

They’re important to learn because Pandas is used in many tasks ranging from data science to training ML models.

We’ll focus on the case where the condition involves a single column, and the approaches can be adapted to work with multiple columns as well. We tested the code using Pandas 2.2.1, but it should work on the other modern versions of Pandas, too.

2. Data

Here’s the data frame we’ll use in our examples:

import pandas as pd

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 1, 2],
    'b' : ['VAA', 'WBB', 'XCC', 'YDD', 'ZAA', 'SBC', 'TCC'],
    'c' : pd.to_datetime(['2025-01-01', '2025-03-01', '2024-08-04',
                          '2024-11-08', '2023-09-02', '2025-02-05', '2025-01-07'])
})

It has an integer column a, a string column b, and a DateTime column c.

3. Boolean Indexing

We can create a Boolean mask by evaluating relational expressions involving columns.

The general syntax is:

df[col] op value
# or
value op df[col]

where:

df is a dataframe
col is the column’s name as a string
op is a binary operator that evaluates to True or False
value is a value of the same type as the elements of the column df[col]

For each element x of df[col], Pandas internally evaluates x op value and returns a Boolean mask of True and False. We can use it as a logical index to retrieve only the rows where the mask is True:

df[df[col] op value]

Let’s check a few examples.

3.1. Numerical Columns

We can use all the relational operators in Python to compare a column to a value: <, >, <=, >=, ==, and !=. For example:

df[df['a'] < 3]

returns

a  b  c
0  1  VAA 2025-01-01
1  2  WBB 2025-03-01
5  1  SBC 2025-02-05
6  2  TCC 2025-01-07

If we want to get the columns where a condition isn’t fulfilled, we use ~ to flip the Boolean mask (convert True to False and False to True):

df[~(df['a'] < 3)]

We get:

a  b  c
2  3  XCC 2024-08-04
3  4  YDD 2024-11-08
4  5  ZAA 2023-09-02

We must put the original mask in the parentheses for the correct results.

If we want to test for equality with several values, we should use the isin operator:

df[df['a'].isin([1, 5])]

The result is:

a  b  c
0  1  VAA 2025-01-01
4  5  ZAA 2023-09-02
5  1  SBC 2025-02-05

The operator accepts an iterable of values and checks if column elements are its members.

3.2. String Columns

We can use said relational operators with string columns and string values.

However, we must reference the str accessor to use string-specific methods such as contains(), lower(), upper(), startswith(), endswith(), split(), and others:

# df['b'] ends with 'C'
df[df['b'].str.endswith('C')]

# df['b'] contains 'C'
df[df['b'].str.contains('DD')]

# df['b'], transformed to lowercase, contains 'bc'
df[df['b'].str.lower().str.contains('bc')]

Here’s a list of commonly used methods:

str.contains(s) – checks if column elements contain s as their substring
str.lower() – converts characters to lowercase
str.upper() – converts characters to uppercase
str.startswith(s) – checks if column elements start with substring s
str.endswith(s) – checks if column elements end with substring s
str.split(delimiter) – splits using said delimiter

3.3. DateTime Columns

Similarly, we access DateTime methods and properties using the dt accessor:

# df['c'] year is odd
df[df['c'].dt.year % 2 == 1]

# df['c'] month is September
df[df['c'].dt.month == 9]
# or
df[df['c'].dt.month_name() == 'September']

# df['c'] weekday is Monday, Tuesday or Friday
df['c'].weekday.isin([1, 2, 5])
# or
df['c'].dt.day_name().isin(['Monday', 'Tuesday', 'Friday'])

Here’s a list of commonly used constructs:

dt.year, dt.month, dt.day – extracts the year, month, and day of the dates
dt.weekday – returns the weekdays: 1 for Monday, 2 for Tuesday, …
dt.month_name() – returns the month names
dt.day_name() – returns the day names
dt.isocalendar() – returns triples (year, week in the year, day in the week)

4. Using the query() Method

The query() method of Pandas.DataFrame can shorten the code with Boolean indexing. Instead of creating a mask and using it as a logical index, we write conditions as strings referencing column names and values. For example:

df.query('a < 3')

is equivalent to df[df[‘a’] < 3] but shorter.

If a column name contains spaces, we enclose it with backticks:

# Create a new column
df['a new column'] = df['a'] + 1

Here’s the updated dataframe:

   a  b   c           a new column
0  1  VAA 2025-01-01             2
1  2  WBB 2025-03-01             3
2  3  XCC 2024-08-04             4
3  4  YDD 2024-11-08             5
4  5  ZAA 2023-09-02             6
5  1  SBC 2025-02-05             2
6  2  TCC 2025-01-07             3

Now, let’s filter it by the new column, whose name contains spaces:

df.query('`a new column` < 3')

The result is:

   a  b   c           a new column
0  1  VAA 2025-01-01             2
5  1  SBC 2025-02-05             2
</code

We can use a Python variable in the query string if we prepend its name with @:

x = 3
df.query('a < @x')

This returns:

   a  b   c           a new column
0  1  VAA 2025-01-01             2
1  2  WBB 2025-03-01             3
5  1  SBC 2025-02-05             2
6  2  TCC 2025-01-07             3

We’ll get UndefinedVariableError if the referenced variable doesn’t exist in the current scope.

5. SQL-Like Queries

Several SQL interfaces allow us to query a Pandas dataframe like an SQL table. We’ll show how to use DuckDB.

We can install it using pip:

pip install duckdb

Then, we treat the dataframe variable as a table name whose columns we can use in WHERE and SELECT clauses. We pass the query string to the query() function of the duckdb module:

import duckdb

result = duckdb.query('SELECT * FROM df WHERE a < 3')

The result variable is a DuckDB class for table relations, whose string form is formatted as a table:


┌───────┬─────────┬─────────────────────┬──────────────┐
│   a   │    b    │          c          │ a new column │
│ int64 │ varchar │    timestamp_ns     │    int64     │
├───────┼─────────┼─────────────────────┼──────────────┤
│     1 │ VAA     │ 2025-01-01 00:00:00 │            2 │
│     2 │ WBB     │ 2025-03-01 00:00:00 │            3 │
│     1 │ SBC     │ 2025-02-05 00:00:00 │            2 │
│     2 │ TCC     │ 2025-01-07 00:00:00 │            3 │
└───────┴─────────┴─────────────────────┴──────────────┘

We can call the df() methods to convert it to Pandas.DataFrame:

df_result = duckdb.query('SELECT * FROM df WHERE a < 3').df()

6. Comparison

Which approach is the fastest? It depends on the dataframe size, the query, and the environment. In our test, we created a dataframe with five integer columns and a million rows of random values:

import numpy as np
import pandas as pd

generator = np.random.default_rng(19)
random_data = generator.choice(100, size=(1000000, 5))
df = pd.DataFrame(random_data, columns=['a', 'b', 'c', 'd', 'e'])

We used timeit with 30 runs and automatically determined number of loops to compare run times:

Method	Mean ± std. dev. of 30 runs	Loops in a Run
duckdb.query(‘SELECT FROM df WHERE a < 20’)*	722 µs ± 298 µs per loop	1000
df[df[‘a’] < 20]	9.3 ms ± 1.76 ms per loop	100
df.query(‘a < 20’)	24.6 ms ± 6.16 ms per loop	10

In this particular test, DuckDb was the fastest, while boolean indexing worked faster than the query() method.

7. Conclusion

In this article, we explored how to filter a Pandas dataframe based on a single column.

We presented three methods: boolean indexing, the query() method of Pandas.DataFrame, and the SQL interface DuckDb.

In our benchmark test, DuckDb worked the best, but that need not be the case in every particular application.

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung