How to Search for String in the Whole DataFrame in Pandas
To search for a string in all columns of a Pandas DataFrame we can use two different ways:
(1) Lambda and str.contains
df.apply(lambda row: row.astype(str).str.contains('data').any(), axis=1)
(2) np.column_stack + str.contains
import numpy as np
mask = np.column_stack([df[col].astype(str).str.contains("data", na=False) for col in df])
df.loc[mask.any(axis=1)]
Let's check two examples on how to use the above techniques in practice.
To start with DataFrame like:
from IPython.display import HTML
import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4],
'name':['Softhints\nLinux', 'dataplotplus', 'DataScientyst\nPandas', 'test'],
'url':['https://www.softhints.com', 'https://dataplotplus.com/', 'https://datascientyst.com', 'test\data']
})
which has this data:
id | name | url |
---|---|---|
1 | Softhints\nLinux | https://www.softhints.com |
2 | dataplotplus | https://dataplotplus.com/ |
3 | DataScientyst\nPandas | https://datascientyst.com |
4 | test | test\data |
Search whole DataFrame with lambda and str.contains
Searching with lambda and str.contains is straightforward:
df.apply(lambda row: row.astype(str).str.contains('data').any(), axis=1)
The lambda will iterate over all rows. Then we will convert the values to string - in order to avoid errors.
The converted data will be searched for a string pattern - in this case data
.
Without method any we will get the search result for each value:
id | name | url |
---|---|---|
False | False | False |
False | True | True |
False | False | True |
False | False | True |
So method any
will return True if there is at least one True value per row.
So the final output is:
0 False
1 True
2 True
3 True
dtype: bool
Search whole DataFrame with numpy and str.contains
As an alternative solution you can use the Numpy method - column_stack
to find all values in all columns. This solution is faster than the previous one.
import numpy as np
mask = np.column_stack([df[col].astype(str).str.contains("data", na=False) for col in df])
df.loc[mask.any(axis=1)]
How does it work?
So the code:
[df[col].astype(str).str.contains("data", na=False) for col in df]
will iterate over all columns and then will convert values to string. Then we perform a search for a given value.
The result would be:
[0 False
1 False
2 False
3 False
Name: id, dtype: bool,
0 False
1 True
2 False
3 False
Name: name, dtype: bool,
0 False
1 True
2 True
3 True
Name: url, dtype: bool]
Method np.column_stack
will convert the above result into array
:
array([[False, False, False],
[False, True, True],
[False, False, True],
[False, False, True]])
Again we are going to use method any
in order to return a single True or False value per row. The method loc
will return only the rows which contain the searched value.
df.loc[mask.any(axis=1)]