How to Create a Bag of Words in Pandas Python

In this short guide, I'll show you how to create a bag of words with Pandas and Python. You can find a example of bag of words using the sklearn library:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

text = ['The fox jumps over the lazy dog.',  'Dog and fox are lazy!']
data = {'text': text}
df = pd.DataFrame(data)

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(df['text'])

count_array = bow.toarray()
features = vectorizer.get_feature_names()
df = pd.DataFrame(data=count_array, columns=features)

Below you can find the result of the code:

In the next steps I'll explain the process in more detail.

What is a Bag of Words?

A bag of words is a way to represent text data in tabular form as numerical features.

You can also find a quick solution only with Pandas and Python below:

import pandas as pd
from collections import Counter

text = ['Periods of rain', 'Mostly cloudy, a little rain', 'Mostly cloudy', 'Intervals of clouds and sun', 'Sunshine and mild']

df = pd.DataFrame({'text': text})

The input DataFrame is:

0 Periods of rain
1 Mostly cloudy, a little rain
2 Mostly cloudy
3 Intervals of clouds and sun
4 Sunshine and mild

The output bag of words represented again as DataFrame:

Periods of rain Mostly cloudy, a little cloudy Intervals clouds and sun Sunshine mild
0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN
3 NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0 1.0

For a bag of words data preprocessing is needed. Remove special characters and stopwords, convert words to lowercase, stemming etc.

The diagram below shows how to create a bag of words from multiple documents. Each item in the list is considered as separate document.


First lets create a sample DataFrame for this example:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

text = ['Periods of rain', 'Mostly cloudy, a little rain', 'Mostly cloudy', 'Intervals of clouds and sun', 'Sunshine and mild']
data = {'text': text}
df = pd.DataFrame(data)


0 Periods of rain
1 Mostly cloudy, a little rain
2 Mostly cloudy
3 Intervals of clouds and sun
4 Sunshine and mild

We are going to import CountVectorizer from sklearn.

Step 1: Initialize the vectorizer

Next we are going to initialize the vectorizer of sklearn:

vectorizer = CountVectorizer()


At this step we can do customization like on/off of lowercase conversion:

vectorizer = CountVectorizer(lowercase=False)

Stop words

or adding custom stop words for the bag of words:

vectorizer = CountVectorizer(stop_words= ['on', 'off'])

even using stop words based on languages:

coun_vect = CountVectorizer(stop_words='english')

max_df / min_df

The abbreviation df in max_df / min_df stands for document frequency.

  • max_df = 0.75 - ignore terms that appear in more than 75% of the documents.
  • max_df = 0.5 - ignore terms that appear in less than 50% documents.

When using a float in the range [0.0, 1.0] they refer to the document frequency.

They can be used also in sense of max_df = 10 - which means ignore terms that appear in less than 10 documents

coun_vect = CountVectorizer(max_df=1)

Step 2: Fit and transform the text data

Next step is to fit and transform the text data to create a bag of words:

bow = vectorizer.fit_transform(df['text'])

This creates a bag of words from the DataFrame column like:

(0, 8)    1
(0, 7)    1
(0, 9)    1
(1, 9)    1
(1, 6)    1
(1, 2)    1
(1, 4)    1
(2, 6)    1
(2, 2)    1
(3, 7)    1
(3, 3)    1
(3, 1)    1

This is a sparse matrix, where:

  • each row represents a document
  • each column represents a word
  • the values in the matrix represent the number of times that word appears in that document.

So in (0, 8) 1 we have:

  • 0 is the number of the document
    • first document
  • 8 number of the feature
    • periods
  • 1 - is the count
    • 1 occurrence

Step 3: Get features names and counts

We can get the word list from the vectorizer, by calling the method get_feature_names():

# get count array and features
count_array = bow.toarray()
features = vectorizer.get_feature_names()

# create DataFrame as bag of words
df = pd.DataFrame(data=count_array, columns=features)

This will create a DataFrame where:

  • each column is a word
  • row represents the documents
  • values are number of times each word is present
and clouds cloudy intervals little mild mostly of periods rain sun sunshine
0 0 0 0 0 0 0 1 1 1 0 0
0 0 1 0 1 0 1 0 0 1 0 0
0 0 1 0 0 0 1 0 0 0 0 0
1 1 0 1 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 0 0 0 1

Normalize the bag of words

Alternatively, we can also use TfidfVectorizer from scikit-learn that creates a bag of words by:

  • first counting the frequency of each word in each document
  • then normalizing the resulting counts by dividing by the total number of words in the document.

The resulting values are the term frequency-inverse document frequency (TF-IDF) values:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

text = ['small dog', 'cute cat', 'cute dog', 'cat']
data = {'Text':text}
df = pd.DataFrame(data)

vectorizer = TfidfVectorizer()
bow = vectorizer.fit_transform(df['Text'])


(0, 2)    0.6191302964899972
(0, 3)    0.7852882757103967
(1, 0)    0.7071067811865475
(1, 1)    0.7071067811865475
(2, 1)    0.7071067811865475
(2, 2)    0.7071067811865475
(3, 0)    1.0

Getting bag of words as a DataFrame with normalized values:

count_array = bow.toarray()
features = vectorizer.get_feature_names()
df = pd.DataFrame(data=count_array, columns=features)
cat cute dog small
0.000000 0.000000 0.619130 0.785288
0.707107 0.707107 0.000000 0.000000
0.000000 0.707107 0.707107 0.000000
1.000000 0.000000 0.000000 0.000000


To summarize, in this article, we've seen examples of bags of words. We've briefly covered what a bag of words is and how to create it with Python, Pandas and scikit-learn.

And finally, we've seen how to normalize a bag of words with scikit-learn.