1. Overview
In this post, we'll explore how to select rows by list of values in Pandas. We'll discuss the basic indexing, which is needed in order to perform selection by values.
Row selection is also known as indexing. There are several ways to select rows by multiple values:
isin()
- Pandas way - exact match from list of valuesdf.query()
- SQL like waydf.loc
+df.apply(lambda
- when custom function is needed to be applied; more flexible way
2. Setup
We'll use the following DataFrame, which consists of several rows and columns:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/extremes.csv')
DataFrame looks like:
Continent | Highest point | Elevation high | Lowest point | Elevation low |
---|---|---|---|---|
Asia | Mount Everest | 8848 | Dead Sea | −427 |
South America | Aconcagua | 6960 | Laguna del Carbón | −105 |
North America | Denali | 6198 | Death Valley | −86 |
Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
Then let's suppose that we would like to search by different lists from single or multiple columns:
- ['America', 'Europe', 'Asia']
- ['America', 'Europe', 'Asia'] and range of numbers [5000, 8000]
3. Select rows by values with method isin()
The method pandas.DataFrame.isin is probably the most popular way for selection by exact match of list of values.
3.1. Positive selection
Let's find all rows which match exactly one of the next values in column - Continent:
- ['America', 'Europe', 'Asia']
sel_continents = ['America', 'Europe', 'Asia']
df[df['Continent'].isin(sel_continents)]
result will be:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
Since method `isin` works by exact match rows for America are not selected.
3.2. Negative selection
To get all rows which doesn't include list of values you can use operator: ~
:
df[~df['Continent'].isin(sel_continents)]
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
1 | South America | Aconcagua | 6960 | Laguna del Carbón | −105 |
2 | North America | Denali | 6198 | Death Valley | −86 |
3 | Africa | Mount Kilimanjaro | 5895 | Lake Assal | −155 |
5 | Antarctica | Vinson Massif | 4892 | Deep Lake, Vestfold Hills | −50 |
6 | Australia | Puncak Jaya | 4884 | Lake Eyre | −15 |
4. Query rows by values - method df.query()
Let's now look at how we can do a query based on a list of values. The method: pandas.DataFrame.query gives us the opportunity to select rows in SQL fashion.
4.1. Query by list of values
To select rows by query and multiple values use the next syntax:
sel_continents = ['America', 'Europe', 'Asia']
df.query('Continent in @sel_continents')
The resulted rows are:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
For columns with spaces in their name, you can use backtick quoting:
df.query('B == `C C`')
which is equivalent to:
df[df.B == df['C C']]
4.2. Select rows by range of numbers
In case of range of values query
method can be used as follows:
df.query('5000 < `Elevation high` < 6000')
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
4.3. Select rows by multiple queries
Selecting by df.query()
is considered to be the fastest way to get rows by values.
So it's suitable for searching based on multiple criterias:
sel_continents = ['America', 'Europe', 'Asia']
df.query('(Continent in @sel_continents) and (`Elevation high` > 8000)')
This returns:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
5. Select rows by values - df.loc
+ df.apply(lambda
Finally let's check a slower but more flexible way of indexing by list of values in Pandas.
5.1. Select rows by function
The basic selection by df.loc
+ df.apply(lambda
does boolean indexing based on lambda function applied over the rows:
sel_continents = ['America', 'Europe', 'Asia']
df.loc[df.apply(lambda x: x['Continent'] in sel_continents, axis=1)]
result:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
def some_function(x):
if x in sel_continents:
return True
else:
return False
sel_continents = ['America', 'Europe', 'Asia']
df[df['Continent'].apply(some_function)]
5.2. Case-insensitive search
Now let's say that you would like to perform case-insensitive search by list of values:
sel_continents = ['America', 'Europe', 'Asia']
sel_continents = [item.lower() for item in sel_continents]
df.loc[df.apply(lambda x: x['Continent'].lower() in sel_continents, axis=1)]
output:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
5.3. Select rows by partial match based on list of values
Custom functions will help us to select rows by multiple values with partial match. For example getting list of rows which contains part of the next values:
- ['America', 'Europe', 'Asia']
def some_function(x):
flag = False
for continent in sel_continents:
if continent in x or continent == x:
flag = True
return flag
sel_continents = ['America', 'Europe', 'Asia']
df[df['Continent'].apply(some_function)]
in the result we have the rows for America:
Continent | Highest point | Elevation high | Lowest point | Elevation low | |
---|---|---|---|---|---|
0 | Asia | Mount Everest | 8848 | Dead Sea | −427 |
1 | South America | Aconcagua | 6960 | Laguna del Carbón | −105 |
2 | North America | Denali | 6198 | Death Valley | −86 |
4 | Europe | Mount Elbrus | 5642 | Caspian Sea | −28 |
6. Conclusion
In this article, we looked at different solutions for selection and indexing of rows based on values. We focused on the basics and followed some advanced techniques like SQL like query and selection by functions.
The code for the examples is available: on GitHub