1. Overview
Here are two approaches to split a column into multiple columns in Pandas:
- list column
- string column separated by a delimiter.
Below we can find both examples:
(1) Split column (list values) into multiple columns
pd.DataFrame(df["langs"].to_list(), columns=['prim_lang', 'sec_lang'])
(2) Split column by delimiter in Pandas
pd.DataFrame(df["skills"].str.split(',').fillna('[]').tolist())
In Pandas to split column we can use method .str.split(',')
- which takes a delimiter as a parameter.
Next we will see how to apply both ways into practical examples.
2. Split List Column into Multiple Columns
For the first example we will create a simple DataFrame with 1 column which stores a list of two languages. We are going to generate 10 random lists of subset of languages:
import random
langs = ['Python', 'Java' , 'JS', 'C', 'C+']
df = pd.DataFrame({"langs": [ [random.choice(langs) for i in range(0,2)] for _ in range(10)]})
Our DataFrame looks like this:
langs |
---|
[C+, C+] |
[Python, JS] |
[C+, C+] |
[Java, Java] |
[Java, C] |
In order to split this single column(which contain list values) into two columns we will use the next syntax:
pd.DataFrame(df["langs"].to_list(), columns=['prim_lang', 'sec_lang'])
The result of the split is:
prim_lang | sec_lang |
---|---|
C+ | C+ |
Python | JS |
C+ | C+ |
Java | Java |
Java | C |
How does it work? The method df["langs"].to_list()
is converting the initial column into list of lists:
[['C+', 'C+'],
['Python', 'JS'],
['C+', 'C+'],
['Java', 'Java'],
['Java', 'C'],
['Java', 'C+'],
['Python', 'C'],
['Python', 'C+'],
['C', 'Java'],
['C+', 'JS']]
Note: This method will work only if the stored values are lists. If you have string values separated by columns check Example 2.
3. Split Column by delimiter in Pandas
Now let's say that instead of storing lists like: ['C+', 'C+']
you have only the values separated by delimiter. In this case is a comma like 'C+', 'C+'
.
Lets have data like the one below:
skills | internship | location |
---|---|---|
Software Testing | Software Testing | Noida |
Java, SQL, Unix, Oracle, MS SQL Server, Hibernate (Java), Shell Scripting, Spring MVC, REST API | Technical Operations - Networking And Monitoring | Noida |
English Proficiency (Spoken), English Proficiency (Written), Hindi Proficiency (Spoken), Hindi Proficiency (Written) | Software Project Management | Work From Home |
HTML, CSS, Flask, Python, Django | Web Development | Work From Home |
HTML, CSS, JavaScript, ReactJS, Redux | Front End Development | Work From Home |
And we would like to split the column skills by delimiter into multiple columns. This time the number of elements is not fixed!
We can use Pandas string method .str.split(',')
in order to split the values into lists of lists. If you have missing data you need to ensure that you default it by empty list by .fillna('[]')
:
pd.DataFrame(df["skills"].str.split(',').fillna('[]').tolist())
This will create DataFrame like:
0 | 1 | 2 | 44 | 45 |
---|---|---|---|---|
Software Testing | None | None | None | None |
Java | SQL | Unix | None | None |
English Proficiency (Spoken) | English Proficiency (Written) | Hindi Proficiency (Spoken) | None | None |
HTML | CSS | Flask | None | None |
HTML | CSS | JavaScript | None | None |
As you can see the result DataFrame has 45 columns. Which means that one of the rows has 45 values separated by comma.
In order to find which row(s) have most values we can use syntax like - test the last column for all non null elements:
pd.DataFrame(df["skills"].str.split(',').fillna('[]').tolist())[45].dropna()
And we get as output:
1242 Gitlab
4. Conclusion and Resources
In this guide we saw how to split columns depending on the values which contain. We covered a column which contains lists and also splitting values separated by delimiter.
Below you can find some useful resources: