Use read_csv to skip rows with condition based on values in Pandas
In this tutorial, we'll look at how to read CSV files by read_csv
and skip rows with a conditional statement in Pandas.
In addition, we'll also see how to optimise the reading performance of the read_csv
method with Dask. This option is useful if you face memory issues using read_csv
.
To start let's say that we have the following CSV file:
gender | race/ethnicity | math score | reading score | writing score |
---|---|---|---|---|
female | group B | 72 | 72 | 74 |
female | group C | 69 | 90 | 88 |
female | group B | 90 | 95 | 93 |
male | group A | 47 | 57 | 44 |
male | group C | 76 | 78 | 75 |
1000 rows × 8 columns
Step 1: Read CSV file skip rows with query condition in Pandas
By default Pandas skiprows
parameter of method read_csv
is supposed to filter rows based on row number and not the row content.
So the default behavior is:
pd.read_csv(csv_file, skiprows=5)
The code above will result into:
995 rows × 8 columns
But let's say that we would like to skip rows based on the condition on their content. This can be achieved by reading the CSV file in chunks with chunksize
.
The results will be filtered by query
condition:
gen = pd.read_csv(csv_file, chunksize=10000000)
df = pd.concat((x.query("lunch == 'standard'") for x in gen), ignore_index=True)
result:
645 rows × 8 columns
The above code will filter CSV rows based on column lunch
. It will return only rows containing standard
to the output.
Step 2: Read CSV file with condition value higher than threshold
In this step we are going to compare the row value in the rows against integer value. If the value is equal or higher we will load the row in the CSV file.
Difference with the previous step is:
- the usage of
dtype
parameter which defines a schema for the columns read from the CSV file - how to use query with column which contains space -
math score
- surrounded by `
schema={
"math score": int
}
gen = pd.read_csv(csv_file, dtype=schema, chunksize=10000000)
df = pd.concat((x.query("`math score` >= 75") for x in gen), ignore_index=True)
The code above will filter all rows which contain math score
higher or equal to 75:
295 rows × 8 columns
Step 3: Read CSV file and post filter condition in Pandas
For small and medium CSV files it's fine to read the whole file and do a post filtering based on read values.
This can be achieved by:
df = pd.read_csv(csv_file)
df = df[df['lunch'] != 'standard']
result:
355 rows × 8 columns
So first we read the whole file. Next we are filtering the results based on one or multiple conditions.
Step 4: Read CSV with conditional filtering by Dask
Finally let's see how to read a CSV file with condition and optimised performance.
To install Dask use:
pip install dask
Dask offers a lazy reader which can optimize performance of read_csv
.
So first we can read the CSV file, then apply the filtering and finally to compute the results:
import dask.dataframe as dd
df = dd.read_csv(csv_file)
df = df[(df['reading score'] >= 55) & (df['reading score'] <= 75)]
df = df.compute()
result is:
496 rows × 8 columns