Pandas random sampling: stratified and weighted

In this quick tutorial, we're going to discuss stratified sampling in Pandas and Python.

The following syntax can be used to sample stratified in Pandas:

(1) stratified sampling - disproportionated

(df
	.groupby('continent', group_keys=False)
	.apply(lambda x: x.sample(2))
)

(2) stratified sampling - proportional

(df
	.groupby('continent', group_keys=False)
	.apply(lambda x: x.sample(frac=0.1))
)

The image below illustrates the technique called stratified sampling:

Next, you'll see the steps to do stratified sampling in practice.

Setup

First, let's create a sample DataFrame:

import plotly.express as px
df = px.data.gapminder().query("year == 2007")
cols = df.columns[:4]
df = df[cols]

data has the following shape:

(142, 5)

First rows of this DataFrame

	country	continent	year	lifeExp	freq
11	Afghanistan	Asia	2007	43.828	33
23	Albania	Europe	2007	76.423	30
35	Algeria	Africa	2007	72.301	52
47	Angola	Africa	2007	42.731	52
59	Argentina	Americas	2007	75.320	25

Separate population into strata

The whole population of this dataset is 142 countries.

Below you can find the proportions per each stratum:

df[col_name].value_counts()

result is:

Africa  	52
Asia    	33
Europe  	30
Americas	25
Oceania  	2
Name: continent, dtype: int64

Find the sample size

Next we need to decide on what should be the size of the sample. There are different strategies on this.

These are the options available in Pandas sample() method:

n - number of items to return. We get N random items
frac - fraction of items to return
weights - probability weighting

So we can select the size of:

total sample as percentage on the whole population
each group
- disproportionated
- proportionated

Disproportional stratified sampling

In this approach the size of each sample group is not proportional to the entire population.

We will get equal number of items for each group:

(df
	.groupby('continent', group_keys=False)
	.apply(lambda x: x.sample(2))
)

The result is 2 sized stratum - no matter the size of each group:

	country	continent	year	lifeExp	freq
911	Libya	Africa	2007	73.952	52
899	Liberia	Africa	2007	45.678	52
443	Dominican Republic	Americas	2007	72.235	25
791	Jamaica	Americas	2007	72.567	25
1679	Yemen, Rep.	Asia	2007	62.698	33
1319	Saudi Arabia	Asia	2007	72.777	33
779	Italy	Europe	2007	80.546	30
1607	United Kingdom	Europe	2007	79.425	30
1103	New Zealand	Oceania	2007	80.204	2
71	Australia	Oceania	2007	81.235	2

Proportional stratified sampling

Taking random sampling from stratified groups which is proportional to the population.

We can do proportional stratified sampling in Pandas by sampling with parameter x.sample(frac=0.1):

(df
	.groupby('continent', group_keys=False)
	.apply(lambda x: x.sample(frac=0.1))
)

This will give us countries proportioned to the initial population:

	country	continent	year	lifeExp	freq
491	Equatorial Guinea	Africa	2007	51.579	52
335	Congo, Dem. Rep.	Africa	2007	46.462	52
1691	Zambia	Africa	2007	42.384	52
131	Benin	Africa	2007	56.728	52
1571	Tunisia	Africa	2007	73.923	52
443	Dominican Republic	Americas	2007	72.235	25
1643	Venezuela	Americas	2007	73.747	25
815	Jordan	Asia	2007	72.535	33
1655	Vietnam	Asia	2007	74.249	33
875	Lebanon	Asia	2007	71.993	33
1607	United Kingdom	Europe	2007	79.425	30
1091	Netherlands	Europe	2007	79.762	30
539	France	Europe	2007	80.657	30

Africa is the most represented continent while Oceania is missing from this sample.

color row based on column value

If you like to learn how to style each group into different color check: color Pandas DataFrame based on value

def format_color_groups(df):
	colors = ['gold', 'lightblue']
	x = df.copy()
	factors = list(x['continent'].unique())
	i = 0
	for factor in factors:
    	style = f'background-color: {colors[i]}'
    	x.loc[x['continent'] == factor, :] = style
    	i = not i
	return x

d1.style.apply(format_color_groups, axis=None)

The result is:

group by sample weights in Pandas

We can use weights to get proportional sampling. First we need to calculate the weights of each stratum:

Calc weights - groupby + .transform('count')

df['weight'] = (df
 	.groupby('continent')
 	.country
 	.transform('count')
)

as a result we have Pandas series which contains the weight for each row:

11  	33
23  	30
35  	52
47  	52
59  	25
    	..
1655	33
1667	33
1679	33
1691	52
1703	52
Name: country, Length: 142, dtype: int64

Calc weights - equal representation

To get equal representation of each group we can calculate the weights:

df['weight'] = 1./(df
 	.groupby('continent')
 	.country
 	.transform('count')    
)

The get disproportional rate:

11  	0.030303
23  	0.033333
35  	0.019231
47  	0.019231
59  	0.040000
      	...   
1655	0.030303
1667	0.030303
1679	0.030303
1691	0.019231
1703	0.019231
Name: weight, Length: 142, dtype: float64

Calc weights - map and value_counts

We can achieve the same result by .map and .value_counts()

df['weight'] = (df
 	.continent
 	.map(df['continent'].value_counts())
)

sampling with weights

To do sampling with respect to distribution of a value in a given column we can use the calculate frequency for parameter weights:

df.sample(n=10, weights = df['weight'])

Result is weighted sampling:

	country	continent	year	lifeExp	freq	weight
851	Korea, Rep.	Asia	2007	78.623	33	33
491	Equatorial Guinea	Africa	2007	51.579	52	52
1043	Mozambique	Africa	2007	42.082	52	52
1175	Pakistan	Asia	2007	65.483	33	33
1223	Philippines	Asia	2007	71.688	33	33
1667	West Bank and Gaza	Asia	2007	73.422	33	33
1031	Morocco	Africa	2007	71.164	52	52
1139	Nigeria	Africa	2007	46.859	52	52
1559	Trinidad and Tobago	Americas	2007	69.819	25	25
635	Guinea-Bissau	Africa	2007	46.388	52	52

Conclusion

In this article, we took a closer look at stratified sampling in Pandas and how to apply it in practice. We covered the two main approaches in stratified sampling - disproportionated and proportionated.

Finally we explain how to use weights to sample and groupby in Pandas DataFrame.