1. Overview

In this quick tutorial, we'll be discussing the creation of DataFrame for tests or Dummy DataFrame.

We'll take a quick look at different ways to create DataFrames with data suitable for different kinds of tests.

2. Fake Realistic DataFrame

To create DataFrame with fake realistic data please refer to this article:
How To Make a Fake Data Set in Python and Pandas

3. Empty DataFrame

Creation of empty DataFrame for tests is covered in: https://datascientyst.com/create-empty-dataframe-pandas/

4. Dummy DataFrame with Random Numbers

DataFrame creation with random numbers is covered in this post: https://datascientyst.com/how-to-create-a-dataframe-of-random-integers-with-pandas/

5. Dummy DataFrame with Time Series data

It's possible to create a Series or DataFrame with time series data for tests. Both have index datetime and numeric values.

5.1. Series

To create time series with dummy data we can use method makeTimeSeries:

import pandas as pd
from pandas.util.testing import makeTimeSeries

df = makeTimeSeries()
df.head()

result:

2000-01-03   -2.066624
2000-01-04   -0.897585
2000-01-05   -0.458669
2000-01-06    1.038565
2000-01-07   -0.058897
Freq: B, dtype: float64

There will be 30 values created.

Note:

If you get: FutureWarning: pandas.util.testing is deprecated or error you can replace: `pandas.util.testing` for `pandas.testing`

To learn more about the hidden methods in Pandas you can search for them in Pandas repo: makeMissingDataframe

5.2. DataFrame

We can create dummy DataFrame with time series data by using hidden method(not in the official docs) makePeriodFrame:

import pandas as pd
from pandas.util.testing import makePeriodFrame

df = makePeriodFrame()
df.head()

result:

A B C D
2000-01-03 0.328536 -0.356004 0.913654 -0.778496
2000-01-04 -0.501882 -0.476268 -1.426332 1.342977
2000-01-05 -0.816284 0.028936 -0.890490 -0.531186
2000-01-06 0.806613 -0.364285 -0.285627 -1.083140
2000-01-07 -0.250598 0.223422 0.760379 0.220205
Note:

If you get: FutureWarning: pandas.util.testing is deprecated or error you can replace: `pandas.util.testing` for `pandas.testing`

6. Dummy DataFrame with missing data

What about DataFrame with missing data? Pandas has a method for that case - makeMissingDataframe.

This method generates DataFrame with 5 columns with numeric data. There is missing data on random positionse0eb6ae0eb6a:

import pandas as pd
from pandas.util.testing import makeMissingDataframe

df = makeMissingDataframe()
df.shape

this will result into:

(30, 4)

And data will look like:

A B C D
PlqBxeS6rd 0.777225 NaN 2.474689 -0.636386
SwbHSSz1EM -0.281544 -0.900471 -1.284081 NaN
d59w8ixnJO 1.205294 0.170578 -0.930505 -0.095696
D5stwhgvIN 0.037623 -1.088020 0.058592 0.371408
fdIzwVg1SY NaN -1.294436 1.019611 -1.128139

7. Dummy DataFrame with mixed data

To create DataFrame with mixed test data use the following code:

import pandas as pd
from pandas.util.testing import makeMixedDataFrame

df = makeMixedDataFrame()
df.shape

result:

A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07

8. Dummy DataFrame with periods

Finally let's cover the case when we need a dataframe with date time values. Method for Pandas testing purposes makeTimeDataFrame will help us:

import pandas as pd
from pandas.util.testing import makePeriodFrame

df = makePeriodFrame()
df.shape

result:

A B C D
2000-01-03 0.962437 -0.097855 -1.304872 -1.364205
2000-01-04 0.829495 -1.404300 -0.992702 0.562428
2000-01-05 -0.012089 1.445668 -0.108887 -1.282800
2000-01-06 0.301585 0.661285 -0.250997 0.570749
2000-01-07 0.462778 1.721186 -1.315672 0.201622

If you wonder what is the difference between methods:

  • makePeriodFrame() - dtype='period[B]')
  • makeTimeDataFrame() - dtype='datetime64[ns]', freq='B')

The type of the column is the answer.

Conclusion

So, we've taken a deep dive into creating dummy DataFrames for test purposes.

And we've taken a look at some edge cases which will speed up your tests. High quality dummy data is essential for the start phase of Data Science projects.