1. Overview
In this tutorial, we're going to select rows in Pandas DataFrame based on column values. Selecting rows in Pandas terminology is known as indexing.
We'll first look into boolean indexing, then indexing by label, the positional indexing, and finally the df.query() API.
We will cover methods like .loc
/ iloc
/ isin()
and some caveats related to their usage.
2. Setup
In the post, we'll use the following DataFrame, which consists of several rows and columns:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/extremes.csv')
DataFrame looks like:
Continent | Highest point | Elevation high | Lowest point | Elevation low |
---|---|---|---|---|
Asia | Mount Everest | 8848 | Dead Sea | −427 |
South America | Aconcagua | 6960 | Laguna del Carbón | −105 |
North America | Denali | 6198 | Death Valley | −86 |
Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
3. Select rows by boolean indexing
Pandas and Python use the operator []
for indexing. It is used for quick access in many use cases.
Since data type isn’t known in advance, directly using standard operators has some performance limits. So it's better to use optimized Pandas data access methods.
3.1. What is boolean indexing?
Boolean indexing in Pandas helps us to select rows or columns by array of boolean values. For example suppose we have the next values: [True, False, True, False, True, False, True]
we can use it to get rows from DataFrame defined above:
selection = [True, False, True, False, True, False, True]
df[selection]
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
2 | North America | Denali | 6198 | Death Valley | −86 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
6 | Australia | Puncak Jaya | 4884 | Lake Eyre | −15 |
3.2. Select Rows by Column Value with boolean indexing
In most cases the boolean list will be generated by Pandas methods. For example, let's find all rows where the continent starts with capital A.
The first step is to get a list of values where this statement is True. We are going to use string method - str.startswith()
:
df['Continent'].str.startswith('A')
The result is Pandas Series with boolean values:
0 True
1 False
2 False
3 True
4 False
5 True
6 True
Name: Continent, dtype: bool
We can use the result (which match the DataFrame shape) for rows selection by using operator []
:
df[df['Continent'].str.startswith('A')]
only rows which match the criteria are returned:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
3 | Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
5 | Antarctica | Vinson Massif | 4892 | Deep Lake, Vestfold Hills | −50 |
6 | Australia | Puncak Jaya | 4884 | Lake Eyre | −15 |
3.3. Select Rows by single value
Selecting rows based on column values is done by operator: ==
:
df[df['Continent'] == 'Africa']
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
3 | Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
3.4. Select Rows by multiple value
To select rows by list of values use the method - isin()
:
df[df['Continent'].isin(['Africa', 'Europe'])]
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
3 | Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
To learn more about selection based on list of values check: How to Select Rows by List of Values in Pandas DataFrame
4. Select rows by property .loc[]
Pandas has property pandas.DataFrame.loc which is documented as:
Access a group of rows and columns by label(s) or a boolean array.
You can use .loc[]
with:
- a boolean array
- labels
The boolean indexing with .loc[]
is similar to the point 3:
df.loc[df['Continent'] == 'Africa']
Note that contrary to usual Python slices, both the start and the stop are included
4.1. .loc[]
for single value
Now let's cover the label usage for a single value - selecting the row with label 0:
df.loc[0]
result into Series:
Continent Asia
Highest point Mount Everest
Elevation high 8848
Lowest point Dead Sea
Elevation low −427
Name: 0, dtype: object
4.2. .loc[]
for multiple labels
And for selection rows by multiple labels - this time we pass a list with multiple values:
df.loc[[0,2,4]]
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
2 | North America | Denali | 6198 | Death Valley | −86 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
In the above cases the labels are auto generated. In some cases you will have column set as index:
df.set_index('Continent')
result:
Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|
Continent | ||||
Asia | Mount Everest | 8848 | Dead Sea | −427 |
South America | Aconcagua | 6960 | Laguna del Carbón | −105 |
then you can select by labels by giving the column values:
df.set_index('Continent').loc["Africa"]
result is:
Highest point Mount Kilimanjaro
Elevation high 5895
Lowest point Lake Assal
Elevation low −155
Name: Africa, dtype: object
5. Select rows by positions - iloc[]
It's possible to use the row positions for selection - positional indexing. This is available by property: pandas.DataFrame.iloc.
For the selection we are going to:
- use numpy
- build a mask for highest points which contains
Mount
- get all positions of those records
- select rows by their positions
import numpy as np
mask = df['Highest point'].str.contains('Mount')
pos = np.flatnonzero(mask)
df.iloc[pos]
Only 3 rows are selected:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
3 | Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
Note that Pandas has method: pandas.DataFrame.mask
6. Select rows by - df.query()
The final option which we will cover is method: pandas.DataFrame.query.
The syntax reminds to SQL:
df.query('Continent == "Europe"')
Only one row is returned( as a DataFrame and not Series!):
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
For more advanced examples please check(multiple criteria, partial match spaces): How to Select Rows by List of Values in Pandas DataFrame
6. Conclusion
Selecting rows in Pandas is available by many options and methods. Unfortunately, some of them might be slow for big DataFrames. Also, since the selection depends on the data types and criteria, choosing the optimal way is important.
My recommendation is to start with the simplest one for you and search for a better solution only in case of a need.
Check out the complete code for How to Select Rows by Column Value in Pandas on GitHub.