1. Overview
In this quick tutorial, we'll be discussing the creation of DataFrame for tests or Dummy DataFrame.
We'll take a quick look at different ways to create DataFrames with data suitable for different kinds of tests.
2. Fake Realistic DataFrame
To create DataFrame with fake realistic data please refer to this article:
How To Make a Fake Data Set in Python and Pandas
3. Empty DataFrame
Creation of empty DataFrame for tests is covered in: https://datascientyst.com/create-empty-dataframe-pandas/
4. Dummy DataFrame with Random Numbers
DataFrame creation with random numbers is covered in this post: https://datascientyst.com/how-to-create-a-dataframe-of-random-integers-with-pandas/
5. Dummy DataFrame with Time Series data
It's possible to create a Series or DataFrame with time series data for tests. Both have index datetime and numeric values.
5.1. Series
To create time series with dummy data we can use method makeTimeSeries
:
import pandas as pd
from pandas.util.testing import makeTimeSeries
df = makeTimeSeries()
df.head()
result:
2000-01-03 -2.066624
2000-01-04 -0.897585
2000-01-05 -0.458669
2000-01-06 1.038565
2000-01-07 -0.058897
Freq: B, dtype: float64
There will be 30 values created.
If you get: FutureWarning: pandas.util.testing is deprecated or error you can replace: `pandas.util.testing` for `pandas.testing`
To learn more about the hidden methods in Pandas you can search for them in Pandas repo: makeMissingDataframe
5.2. DataFrame
We can create dummy DataFrame with time series data by using hidden method(not in the official docs) makePeriodFrame
:
import pandas as pd
from pandas.util.testing import makePeriodFrame
df = makePeriodFrame()
df.head()
result:
A | B | C | D | |
---|---|---|---|---|
2000-01-03 | 0.328536 | -0.356004 | 0.913654 | -0.778496 |
2000-01-04 | -0.501882 | -0.476268 | -1.426332 | 1.342977 |
2000-01-05 | -0.816284 | 0.028936 | -0.890490 | -0.531186 |
2000-01-06 | 0.806613 | -0.364285 | -0.285627 | -1.083140 |
2000-01-07 | -0.250598 | 0.223422 | 0.760379 | 0.220205 |
If you get: FutureWarning: pandas.util.testing is deprecated or error you can replace: `pandas.util.testing` for `pandas.testing`
6. Dummy DataFrame with missing data
What about DataFrame with missing data? Pandas has a method for that case - makeMissingDataframe
.
This method generates DataFrame with 5 columns with numeric data. There is missing data on random positionse0eb6ae0eb6a:
import pandas as pd
from pandas.util.testing import makeMissingDataframe
df = makeMissingDataframe()
df.shape
this will result into:
(30, 4)
And data will look like:
A | B | C | D | |
---|---|---|---|---|
PlqBxeS6rd | 0.777225 | NaN | 2.474689 | -0.636386 |
SwbHSSz1EM | -0.281544 | -0.900471 | -1.284081 | NaN |
d59w8ixnJO | 1.205294 | 0.170578 | -0.930505 | -0.095696 |
D5stwhgvIN | 0.037623 | -1.088020 | 0.058592 | 0.371408 |
fdIzwVg1SY | NaN | -1.294436 | 1.019611 | -1.128139 |
7. Dummy DataFrame with mixed data
To create DataFrame with mixed test data use the following code:
import pandas as pd
from pandas.util.testing import makeMixedDataFrame
df = makeMixedDataFrame()
df.shape
result:
A | B | C | D | |
---|---|---|---|---|
0 | 0.0 | 0.0 | foo1 | 2009-01-01 |
1 | 1.0 | 1.0 | foo2 | 2009-01-02 |
2 | 2.0 | 0.0 | foo3 | 2009-01-05 |
3 | 3.0 | 1.0 | foo4 | 2009-01-06 |
4 | 4.0 | 0.0 | foo5 | 2009-01-07 |
8. Dummy DataFrame with periods
Finally let's cover the case when we need a dataframe with date time values. Method for Pandas testing purposes makeTimeDataFrame
will help us:
import pandas as pd
from pandas.util.testing import makePeriodFrame
df = makePeriodFrame()
df.shape
result:
A | B | C | D | |
---|---|---|---|---|
2000-01-03 | 0.962437 | -0.097855 | -1.304872 | -1.364205 |
2000-01-04 | 0.829495 | -1.404300 | -0.992702 | 0.562428 |
2000-01-05 | -0.012089 | 1.445668 | -0.108887 | -1.282800 |
2000-01-06 | 0.301585 | 0.661285 | -0.250997 | 0.570749 |
2000-01-07 | 0.462778 | 1.721186 | -1.315672 | 0.201622 |
If you wonder what is the difference between methods:
makePeriodFrame()
-dtype='period[B]')
makeTimeDataFrame()
-dtype='datetime64[ns]', freq='B')
The type of the column is the answer.
Conclusion
So, we've taken a deep dive into creating dummy DataFrames for test purposes.
And we've taken a look at some edge cases which will speed up your tests. High quality dummy data is essential for the start phase of Data Science projects.