1. Overview

In this tutorial, we're going to select rows in Pandas DataFrame based on column values. Selecting rows in Pandas terminology is known as indexing.

We'll first look into boolean indexing, then indexing by label, the positional indexing, and finally the df.query() API.

We will cover methods like .loc / iloc / isin() and some caveats related to their usage.

2. Setup

In the post, we'll use the following DataFrame, which consists of several rows and columns:

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/extremes.csv')

DataFrame looks like:

Continent Highest point Elevation high Lowest point Elevation low
Asia Mount Everest 8848 Dead Sea −427
South America Aconcagua 6960 Laguna del Carbón −105
North America Denali 6198 Death Valley −86
Africa Mount Kilimanjaro 5895 Lake Assal −155
Europe Mount Elbrus 5642 Caspian Sea −28

3. Select rows by boolean indexing

Pandas and Python use the operator [] for indexing. It is used for quick access in many use cases.

Note:

Since data type isn’t known in advance, directly using standard operators has some performance limits. So it's better to use optimized Pandas data access methods.

3.1. What is boolean indexing?

Boolean indexing in Pandas helps us to select rows or columns by array of boolean values. For example suppose we have the next values: [True, False, True, False, True, False, True] we can use it to get rows from DataFrame defined above:

selection = [True, False, True, False, True, False, True]

df[selection]

result:

Continent Highest point Elevation high Lowest point Elevation low
0 Asia Mount Everest 8848 Dead Sea −427
2 North America Denali 6198 Death Valley −86
4 Europe Mount Elbrus 5642 Caspian Sea −28
6 Australia Puncak Jaya 4884 Lake Eyre −15

3.2. Select Rows by Column Value with boolean indexing

In most cases the boolean list will be generated by Pandas methods. For example, let's find all rows where the continent starts with capital A.

The first step is to get a list of values where this statement is True. We are going to use string method - str.startswith():

df['Continent'].str.startswith('A')

The result is Pandas Series with boolean values:

0     True
1    False
2    False
3     True
4    False
5     True
6     True
Name: Continent, dtype: bool

We can use the result (which match the DataFrame shape) for rows selection by using operator []:

df[df['Continent'].str.startswith('A')]

only rows which match the criteria are returned:

Continent Highest point Elevation high Lowest point Elevation low
0 Asia Mount Everest 8848 Dead Sea −427
3 Africa Mount Kilimanjaro 5895 Lake Assal −155
5 Antarctica Vinson Massif 4892 Deep Lake, Vestfold Hills −50
6 Australia Puncak Jaya 4884 Lake Eyre −15

3.3. Select Rows by single value

Selecting rows based on column values is done by operator: ==:

df[df['Continent'] == 'Africa']

result:

Continent Highest point Elevation high Lowest point Elevation low
3 Africa Mount Kilimanjaro 5895 Lake Assal −155

3.4. Select Rows by multiple value

To select rows by list of values use the method - isin():

df[df['Continent'].isin(['Africa', 'Europe'])]

result:

Continent Highest point Elevation high Lowest point Elevation low
3 Africa Mount Kilimanjaro 5895 Lake Assal −155
4 Europe Mount Elbrus 5642 Caspian Sea −28

To learn more about selection based on list of values check: How to Select Rows by List of Values in Pandas DataFrame

4. Select rows by property .loc[]

Pandas has property pandas.DataFrame.loc which is documented as:

Access a group of rows and columns by label(s) or a boolean array.

You can use .loc[] with:

  • a boolean array
  • labels

The boolean indexing with .loc[] is similar to the point 3:

df.loc[df['Continent'] == 'Africa']
Note:

Note that contrary to usual Python slices, both the start and the stop are included

4.1. .loc[] for single value

Now let's cover the label usage for a single value - selecting the row with label 0:

df.loc[0]

result into Series:

Continent                  Asia
Highest point     Mount Everest
Elevation high             8848
Lowest point           Dead Sea
Elevation low              −427
Name: 0, dtype: object

4.2. .loc[] for multiple labels

And for selection rows by multiple labels - this time we pass a list with multiple values:

df.loc[[0,2,4]]

result:

Continent Highest point Elevation high Lowest point Elevation low
0 Asia Mount Everest 8848 Dead Sea −427
2 North America Denali 6198 Death Valley −86
4 Europe Mount Elbrus 5642 Caspian Sea −28

In the above cases the labels are auto generated. In some cases you will have column set as index:

df.set_index('Continent')

result:

Highest point Elevation high Lowest point Elevation low
Continent
Asia Mount Everest 8848 Dead Sea −427
South America Aconcagua 6960 Laguna del Carbón −105

then you can select by labels by giving the column values:

df.set_index('Continent').loc["Africa"]

result is:

Highest point     Mount Kilimanjaro
Elevation high                 5895
Lowest point             Lake Assal
Elevation low                  −155
Name: Africa, dtype: object

5. Select rows by positions - iloc[]

It's possible to use the row positions for selection - positional indexing. This is available by property: pandas.DataFrame.iloc.

For the selection we are going to:

  • use numpy
  • build a mask for highest points which contains Mount
  • get all positions of those records
  • select rows by their positions
import numpy as np

mask = df['Highest point'].str.contains('Mount')
pos = np.flatnonzero(mask)
df.iloc[pos]

Only 3 rows are selected:

Continent Highest point Elevation high Lowest point Elevation low
0 Asia Mount Everest 8848 Dead Sea −427
3 Africa Mount Kilimanjaro 5895 Lake Assal −155
4 Europe Mount Elbrus 5642 Caspian Sea −28

Note that Pandas has method: pandas.DataFrame.mask

6. Select rows by - df.query()

The final option which we will cover is method: pandas.DataFrame.query.

The syntax reminds to SQL:

df.query('Continent == "Europe"')

Only one row is returned( as a DataFrame and not Series!):

Continent Highest point Elevation high Lowest point Elevation low
4 Europe Mount Elbrus 5642 Caspian Sea −28

For more advanced examples please check(multiple criteria, partial match spaces): How to Select Rows by List of Values in Pandas DataFrame

6. Conclusion

Selecting rows in Pandas is available by many options and methods. Unfortunately, some of them might be slow for big DataFrames. Also, since the selection depends on the data types and criteria, choosing the optimal way is important.

My recommendation is to start with the simplest one for you and search for a better solution only in case of a need.

Check out the complete code for How to Select Rows by Column Value in Pandas on GitHub.