How to select rows by column value in Pandas

1. Overview

In this tutorial, we're going to select rows in Pandas DataFrame based on column values. Selecting rows in Pandas terminology is known as indexing.

We'll first look into boolean indexing, then indexing by label, the positional indexing, and finally the df.query() API.

We will cover methods like .loc / iloc / isin() and some caveats related to their usage.

2. Setup

In the post, we'll use the following DataFrame, which consists of several rows and columns:

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/extremes.csv')

DataFrame looks like:

Continent	Highest point	Elevation high	Lowest point	Elevation low
Asia	Mount Everest	8848	Dead Sea	−427
South America	Aconcagua	6960	Laguna del Carbón	−105
North America	Denali	6198	Death Valley	−86
Africa	Mount Kilimanjaro	5895	Lake Assal	−155
Europe	Mount Elbrus	5642	Caspian Sea	−28

3. Select rows by boolean indexing

Pandas and Python use the operator [] for indexing. It is used for quick access in many use cases.

Note:

Since data type isn’t known in advance, directly using standard operators has some performance limits. So it's better to use optimized Pandas data access methods.

3.1. What is boolean indexing?

Boolean indexing in Pandas helps us to select rows or columns by array of boolean values. For example suppose we have the next values: [True, False, True, False, True, False, True] we can use it to get rows from DataFrame defined above:

selection = [True, False, True, False, True, False, True]

df[selection]

result:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
0	Asia	Mount Everest	8848	Dead Sea	−427
2	North America	Denali	6198	Death Valley	−86
4	Europe	Mount Elbrus	5642	Caspian Sea	−28
6	Australia	Puncak Jaya	4884	Lake Eyre	−15

3.2. Select Rows by Column Value with boolean indexing

In most cases the boolean list will be generated by Pandas methods. For example, let's find all rows where the continent starts with capital A.

The first step is to get a list of values where this statement is True. We are going to use string method - str.startswith():

df['Continent'].str.startswith('A')

The result is Pandas Series with boolean values:

0     True
1    False
2    False
3     True
4    False
5     True
6     True
Name: Continent, dtype: bool

We can use the result (which match the DataFrame shape) for rows selection by using operator []:

df[df['Continent'].str.startswith('A')]

only rows which match the criteria are returned:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
0	Asia	Mount Everest	8848	Dead Sea	−427
3	Africa	Mount Kilimanjaro	5895	Lake Assal	−155
5	Antarctica	Vinson Massif	4892	Deep Lake, Vestfold Hills	−50
6	Australia	Puncak Jaya	4884	Lake Eyre	−15

3.3. Select Rows by single value

Selecting rows based on column values is done by operator: ==:

df[df['Continent'] == 'Africa']

result:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
3	Africa	Mount Kilimanjaro	5895	Lake Assal	−155

3.4. Select Rows by multiple value

To select rows by list of values use the method - isin():

df[df['Continent'].isin(['Africa', 'Europe'])]

result:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
3	Africa	Mount Kilimanjaro	5895	Lake Assal	−155
4	Europe	Mount Elbrus	5642	Caspian Sea	−28

To learn more about selection based on list of values check: How to Select Rows by List of Values in Pandas DataFrame

4. Select rows by property `.loc[]`

Pandas has property pandas.DataFrame.loc which is documented as:

Access a group of rows and columns by label(s) or a boolean array.

You can use .loc[] with:

a boolean array
labels

The boolean indexing with .loc[] is similar to the point 3:

df.loc[df['Continent'] == 'Africa']

Note:

Note that contrary to usual Python slices, both the start and the stop are included

4.1. `.loc[]` for single value

Now let's cover the label usage for a single value - selecting the row with label 0:

df.loc[0]

result into Series:

Continent                  Asia
Highest point     Mount Everest
Elevation high             8848
Lowest point           Dead Sea
Elevation low              −427
Name: 0, dtype: object

4.2. `.loc[]` for multiple labels

And for selection rows by multiple labels - this time we pass a list with multiple values:

df.loc[[0,2,4]]

result:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
0	Asia	Mount Everest	8848	Dead Sea	−427
2	North America	Denali	6198	Death Valley	−86
4	Europe	Mount Elbrus	5642	Caspian Sea	−28

In the above cases the labels are auto generated. In some cases you will have column set as index:

df.set_index('Continent')

result:

	Highest point	Elevation high	Lowest point	Elevation low
Continent
Asia	Mount Everest	8848	Dead Sea	−427
South America	Aconcagua	6960	Laguna del Carbón	−105

then you can select by labels by giving the column values:

df.set_index('Continent').loc["Africa"]

result is:

Highest point     Mount Kilimanjaro
Elevation high                 5895
Lowest point             Lake Assal
Elevation low                  −155
Name: Africa, dtype: object

5. Select rows by positions - `iloc[]`

It's possible to use the row positions for selection - positional indexing. This is available by property: pandas.DataFrame.iloc.

For the selection we are going to:

use numpy
build a mask for highest points which contains Mount
get all positions of those records
select rows by their positions

import numpy as np

mask = df['Highest point'].str.contains('Mount')
pos = np.flatnonzero(mask)
df.iloc[pos]

Only 3 rows are selected:

	Continent	Highest point	Elevation high	Lowest point	Elevation low
0	Asia	Mount Everest	8848	Dead Sea	−427
3	Africa	Mount Kilimanjaro	5895	Lake Assal	−155
4	Europe	Mount Elbrus	5642	Caspian Sea	−28

Note that Pandas has method: pandas.DataFrame.mask

6. Select rows by - `df.query()`

The final option which we will cover is method: pandas.DataFrame.query.

The syntax reminds to SQL:

df.query('Continent == "Europe"')

Only one row is returned( as a DataFrame and not Series!):

	Continent	Highest point	Elevation high	Lowest point	Elevation low
4	Europe	Mount Elbrus	5642	Caspian Sea	−28

For more advanced examples please check(multiple criteria, partial match spaces): How to Select Rows by List of Values in Pandas DataFrame

6. Conclusion

Selecting rows in Pandas is available by many options and methods. Unfortunately, some of them might be slow for big DataFrames. Also, since the selection depends on the data types and criteria, choosing the optimal way is important.

My recommendation is to start with the simplest one for you and search for a better solution only in case of a need.

Check out the complete code for How to Select Rows by Column Value in Pandas on GitHub.

1. Overview

2. Setup

3. Select rows by boolean indexing

3.1. What is boolean indexing?

3.2. Select Rows by Column Value with boolean indexing

3.3. Select Rows by single value

3.4. Select Rows by multiple value

4. Select rows by property .loc[]

4.1. .loc[] for single value

4.2. .loc[] for multiple labels

5. Select rows by positions - iloc[]

6. Select rows by - df.query()

6. Conclusion

4. Select rows by property `.loc[]`

4.1. `.loc[]` for single value

4.2. `.loc[]` for multiple labels

5. Select rows by positions - `iloc[]`

6. Select rows by - `df.query()`