How To Make a Fake Data Set in Python and Pandas

1. Overview

Do you want to create a test data set with fake data in Python and Pandas? Pandas in combination with Faker ease creation of test DataFrames with fake and safe for sharing data.

Final output will be DataFrame which can be exported to:

CSV / Excel
JSON
XML
SQL inserts
or text file

High quality test data might be crucial for the success of a given product. On the other hand, using sensitive data might cause legal issues. So the way to go is to create high quality fake data in Pandas and Python.

2. Setup

First we need install additional library - Faker:

by:

pip install Faker

This library can generate data by several providers:

standard - few examples below
- bank
- addresses
- person
community
- music
- vehicle
locales
- Locale en_US
  - faker.providers.address
  - faker.providers.automotive
- Locale es
  - faker.providers.address
  - faker.providers.person

3. Generate data with Faker

We are going to cover the most popular use case of Faker. This step will explain different techniques in detail. To generate good data you need to apply one or several from them.

3.1. Locales - country specific data

Faker uses locales in order to generate country specific data like names and addresses. Generating Italian names:

from faker import Faker
fake = Faker('it_IT')
for _ in range(5):
    print(fake.name())

result:

Liliana Vespa
Sig.ra Vanessa Malacarne
Gemma Fioravanti
Vittorio Antonello
Sandro Trebbi

You can generate data for multiple locales - Italy, US and Japan:

from faker import Faker
fake = Faker(['it_IT', 'en_US', 'ja_JP'])
for _ in range(10):
    print(fake.name())

this will result into:

高橋 聡太郎
藤井 桃子
Aurora Scaramucci
Sarah Beltran
Rosina Durante
Sig.ra Melina Morgagni
佐藤 修平
林 修平
Charles Marks
Lazzaro Ammaniti

3.2. Dynamic Providers - custom data

Faker offers an elegant way of generating data from a custom set of elements. Let say that you like to generate fake skills from next set of words:

"Python"
"Pandas"
"Linux"

Dynamic providers are the Faker way to generate custom data:

skill_provider = DynamicProvider(
     provider_name="skills",
     elements=["Python", "Pandas", "Linux", "SQL", "Data Mining"],
)


fake = Faker('en_US')
fake.add_provider(skill_provider)

fake.skills()

3.3. Numbers and ranges

Generating numbers and ranges is fine with pure Python. The below examples show how to generate random integer between: 0 and 15:

random.randint(0,15)

and random numbers in range with steps:

random.randrange(75000,150000, 5000)

3.4. Dates

For dates we can use the Faker methods - date(). It accepts format, start and end time:

fake.date(pattern="%Y-%m-%d", end_datetime=datetime.date(1995, 1,1))

3.5. Correlated Data

Generation of correlated data is very important for test quality. What does it mean correlated fake data? Generating pairs of country and city should be correlated:

US and Paris - not correlated
US and NY - correlated

The same is for:

names and email
birth date and experience
nationality and native language
experience and salary

Locales

In most cases custom solutions are required in order to get meaningful data. For example by using locales:

from faker import Faker

en_us_faker = Faker('en_US')
it_it_fake = Faker('it_IT')

print(f'{en_us_faker.city()}, USA')
print(f'{it_it_fake.city()}, Italy')

result would be:

East Jonathanbury, USA
Biagio calabro, Italy

Customization

Another option is by using customization. They are described here: Faker Customization

4. Define the output list of fields

Next we should define what fields we need for the final data set.

In this article we are going to generate personal data - employees with names, skills and salaries.

The list of fields for the final data set is:

first name
last name
birth date
email
experience
start year
salary
main skill
nationality
city

Add description for each field if needed like:

first name - Italian and American names
birth date - dates between 1990 and 2000

5. Create DataFrame with Fake Data

Finally let's check how to generate the fake data and then stored it to a DataFrame. The code below will generate 50 records of personal fake data:

import csv
import pandas as pd
from faker import Faker
import datetime
import random
from faker.providers import DynamicProvider

skill_provider = DynamicProvider(
     provider_name="skills",
     elements=["Python", "Pandas", "Linux", "SQL", "Data Mining"],
)


def fake_data_generation(records):
    fake = Faker('en_US')
    
    employee = []
    
    fake.add_provider(skill_provider)

    for i in range(records):
        first_name = fake.first_name()
        last_name = fake.last_name()



        employee.append({
                "First Name": first_name,
                "Last Name": last_name,
                "Birth Date" : fake.date(pattern="%Y-%m-%d", end_datetime=datetime.date(1995, 1,1)),
                "Email": str.lower(f"{first_name}.{last_name}@fake_domain-2.com"),
                "Hobby": fake.word(),
                "Experience" : random.randint(0,15),
                "Start Year": fake.year(),
                "Salary": random.randrange(75000,150000, 5000),
                "City" : fake.city(),
                "Nationality" : fake.country(),
                "Skill": fake.skills()
                })
        
    return employee


df = pd.DataFrame(fake_data_generation(50))

result(output is transposed for redness purpose):

	0	1
First Name	Christopher	April
Last Name	Davis	Collins
Birth Date	1977-09-14	1974-06-08
Email	christopher.davis@fake_domain-2.com	april.collins@fake_domain-2.com
Hobby	wear	family
Experience	0	6
Start Year	1991	1994
Salary	100000	85000
City	Charleschester	East Josephchester
Nationality	Panama	Cambodia
Skill	Pandas	Python

Now the DataFrame can be easily exported to CSV - to_csv etc.

The final result is visible on the image below:

6. Create big CSV file with Fake Data

As a bonus we can see how to** generate huge data sets with fake data**. This time we are going to write directly to a CSV file for performance sake:

import csv
from faker import Faker
import datetime
import random
from faker.providers import DynamicProvider

skill_provider = DynamicProvider(
     provider_name="skills",
     elements=["Python", "Pandas", "Linux", "SQL", "Data Mining"],
)


def fake_data_generation(records, headers):
    fake = Faker('en_US')
    
    fake.add_provider(skill_provider)

    
    with open("employee.csv", 'wt') as csvFile:
        writer = csv.DictWriter(csvFile, fieldnames=headers)
        writer.writeheader()
        for i in range(records):
            first_name = fake.first_name()
            last_name = fake.last_name()
            
            
            
            print({
                    "First Name": first_name,
                    "Last Name": last_name,
                    "Birth Date" : fake.date(pattern="%Y-%m-%d", end_datetime=datetime.date(1995, 1,1)),
                    "Email": str.lower(f"{first_name}.{last_name}@fake_domain-2.com"),
                    "Hobby": fake.word(),
                    "Experience" : random.randint(0,15),
                    "Start Year": fake.year(),
                    "Salary": random.randrange(75000,150000, 5000),
                    "City" : fake.city(),
                    "Nationality" : fake.country(),
                    "Skill": fake.skills()
                    })
    

number_records = 100
fields = ["First Name", "Last Name", "Birth Date", "Email", "Hobby", "Experience",
           "Start Year", "Salary", "City", "Nationality", "Skill"]

fake_data_generation(number_records, fields)

The output will be employee.csv in the current working folder. We providing the headers and the number of the records:

number_records = 100
fields = ["First Name", "Last Name", "Birth Date", "Email", "Hobby", "Experience",
           "Start Year", "Salary", "City", "Nationality", "Skill"]

7. Conclusion

So, we've taken a look into fake data generation; exploring the basics and its basic use.

We've taken a look at some details like correlated data and quality fake data.

Finally, we saw how to export the saved data to different formats and optimized the process.

The code is available on GitHub.