In this short guide, I'll show you how to split Pandas DataFrame. You can also find how to:

  • split a large Pandas DataFrame
  • pandas split dataframe into equal chunks
  • split DataFrame by percentage
  • split dataset into training and testing parts

To start, here is the syntax to split Pandas Dataframe in 5 equal chunks:

import numpy as np
np.array_split(df, 5)

which returns a list of DataFrames. Let's see all the steps in details

Setup

Lets create a sample DataFrame which contains 12 rows:

import pandas as pd
import numpy as np

data = np.random.randint(0,12,size=(12, 4))
df = pd.DataFrame(data, columns=list('ABCD'))

data looks like:

A B C D
0 8 4 7 4
1 1 9 4 5
2 1 4 1 8
3 4 9 3 7
4 8 11 5 9
5 11 9 11 3
6 5 1 6 3
7 2 11 11 7
8 5 9 4 4
9 9 0 11 0
10 4 10 3 8
11 0 0 2 1

Step 1: Split dataframe into n chunks

Numpy method np.array_split() can be used on Pandas DataFrame to split it in n chunks:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,12,size=(12, 4)), columns=list('ABCD'))

chunks = np.array_split(df, 5)

for chunk in chunks:
	print(chunk.shape)
	display(chunk)

The result is 5 chunks. As you can see the first two chunks has 3 rows while the rest 2 rows:

Step 2: Split DataFrame with list comprehension

To split DataFrame by using list comprehensions we can:

  • calculate the chunk size
  • get all rows for a given rage
chunk_size = df.shape[0] // 3
chunks = [df[i:i+chunk_size].copy() for i in range(0, df.shape[0], chunk_size)]

This will divide the input DataFrame into 3 DataFrames:

[   A  B  C  D
 0  8  4  7  4
 1  1  9  4  5
 2  1  4  1  8
 3  4  9  3  7,
     A   B   C  D
 4   8  11   5  9
 5  11   9  11  3
 6   5   1   6  3
 7   2  11  11  7,
     A   B   C  D
 8   5   9   4  4
 9   9   0  11  0
 10  4  10   3  8
 11  0   0   2  1]

Step 3: Split DataFrame into groups

We can use method groupby to split DataFrame into:

  • equal chunks
  • group by criteria

The next code will split DataFrame into equal sized groups:

groups = df.groupby(df.index % 4)

Now we can work with each group to get information like count or sum:

groups.count()
groups.sum()

result:

A B C D
0 3 3 3 3
1 3 3 3 3
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2

Step 4: Split DataFrame by column

We can group by a column and then split the DataFrame:

groups = df.groupby(df['A'] % 2)

To get information for a group we can use method first:

groups.first()

or display all groups

for group in groups:
	display(group)

result:

(0,
  	A   B   C  D
 0	8   4   7  4
 5   11   9  11  3
 10   4  10   3  8)

(1,
 	A  B  C  D
 1   1  9  4  5
 6   5  1  6  3
 11  0  0  2  1)

(2,
	A   B   C  D
 2  1   4   1  8
 7  2  11  11  7)

(3,
	A  B  C  D
 3  4  9  3  7
 8  5  9  4  4)

(4,
	A   B   C  D
 4  8  11   5  9
 9  9   0  11  0)

Step 5: Split large DataFrame

To split large DataFrames we can use the Dask library. We can:

  • convert the Pandas DataFrame to Dask
  • provide number of partitions

To split large DataFrame with Dask we can do:

Read CSV file with Dask and split to DataFrames

In case that you need to read huge CSV file and split it to several DataFrames you can use Dask as follow:

import dask.dataframe as dd
df = dd.read_csv("data.csv").repartition(npartitions=3)

Split large DataFrame with Dask

If the DataFrame exists it can be split to chunks by this method repartition:

df = df.repartition(npartitions=2)

Step 6: Split dataframe by percentage

Finally let say that you need to split your DataFrame to several parts taking into account percentage. For this purpose we can use train_test_split from sklearn. We can specify the training and test size as a percentage.

Split to train and testing data

This is escpecially good for machine learning when you need to split DataFrame into 2 parts for testing and training:

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(df,df.index,test_size=0.2,train_size=0.8)

where:

  • x - contains the first DataFrame
  • y - indexes of first DataFrame
    • Index([0, 3, 6, 9, 2, 8, 10, 11, 7], dtype='int64')
  • x_test - second DataFrame
  • y_test - indexes of the second one
    • Index([4, 5, 1], dtype='int64')

Split by percentage with numpy

We can also use numpy to split the DataFrame into multiple datasets based on percentage from the original data. This can be done by:

partitions = [int(.2*len(df)), int(.5*len(df)), int(.6*len(df))]
a, b, c, d = np.split(df, partitions)

So partitions are calculated and defined as:

[2, 6, 7]

which selects:

  • 2 rows for the first one
  • 4 rows for the second
  • 1 rows for the 3rd one
  • the rest for the 4th

Conclusion

In this post we saw multiple different ways to split DataFrame into several chunks. We have used different additional packages like numpy, sklearn and dask.

You will know how to easily split DataFrame into training and testing datasets. We also covered how to read a huge CSV file and separate it into multiple DataFrames with Dask.

Resources