Need a progress bar for Pandas concat, merge or join operations? If so, you may use the workaround described in this article.

At the moment of writing there's no simple solution for Pandas and tqdm in order to track progress for merge or concat.

In order to get the progress during those operations we will use Dask.

Step 1: Install Dask and TQDM

Dask`tqdm` libraries can be installed by:

pip install tqdm
pip install dask

and upgraded by:

pip install tqdm -U
pip install dask -U

Step 2: Create and convert Pandas DataFrames to Dask

First we are going to create two medium sized DataFrames in Pandas with random numbers from 0 to 700.

Then we are going to convert them to Dask DataFrames.

import pandas as pd
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd

n = 450000
maxa = 700

df1 = pd.DataFrame({'lkey': np.random.randint(0, maxa, n),'lvalue': np.random.randint(0,int(1e8),n)})
df2 = pd.DataFrame({'rkey': np.random.randint(0, maxa, n),'rvalue': np.random.randint(0, int(1e8),n)})

sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)

Step 3: Add progress bar for merge on two DataFrames

Finally we are going to use the Dask progress bar in order to track the progress on merging of two DataFrames.

The are two options available - use context with TqdmCallback(desc="compute")

from tqdm.dask import TqdmCallback
from dask.diagnostics import ProgressBar
ProgressBar().register()

with TqdmCallback(desc="compute"):
    sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()

or use it globally:

# or use callback globally
cb = TqdmCallback(desc="global")
cb.register()
sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()

result:

[                                        ] | 0% Completed |  0.0s
global:   0%|          | 0/31 [00:00<?, ?it/s]
[########################################] | 100% Completed |  4.3s

Step 4: Compare Pandas vs Dask merge performance

The Dask merge operation is faster than Pandas merge and it's optimized for better performance - less resources and computing time.

Below you can find comparison of both:

%timeit df1.merge(df2, left_on='lkey', right_on='rkey')

result:

10 loops, best of 3: 74.7 ms per loop
%timeit sd1.merge(sd2, left_on='lkey', right_on='rkey')

result:

10 loops, best of 3: 20.2 ms per loop

So for a small DataFrame with 45000 rows merging on 2 columns there is about 4 times difference: 74.7 ms vs 20.2 ms

Resources