Learn how to compare every value in a pandas DataFrame column with all following values efficiently.

Sample Data

import pandas as pd

val = [16, 19, 15, 19, 15]
df = pd.DataFrame({'val': val})
val
0 16
1 19
2 15
3 19
4 15

1. Compare with Subsequent Values Using apply

Create a new column with lists of comparison results (e.g., 1 if equal, 0 otherwise) for all later rows:

df['match'] = df.apply(
    lambda row: [
        1 if row['val'] == df.loc[idx, 'val'] else 0
        for idx in range(row.name + 1, len(df))
    ],
    axis=1
)

Result:

val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []

This approach works row-wise and is suitable for moderate-sized DataFrames.

2. Compare Text Values with Subsequent for similarity

You need to install library: python-Levenshtein

!pip install python-Levenshtein

The idea is to match all similarities - i.e. apple and appl:

from Levenshtein import ratio

df_str = pd.DataFrame({'text': ['apple', 'appl', 'banana', 'apple', 'bananna']})

def is_similar(a, b, threshold=0.8):
    return 1 if ratio(a, b) >= threshold else 0

df_str['similar_later'] = df_str.apply(
    lambda row: [
        is_similar(row['text'], df_str.loc[idx, 'text'])
        for idx in range(row.name + 1, len(df_str))
    ],
    axis=1
)

df_str

result:

text similar_later
0 apple [1, 0, 1, 0]
1 appl [0, 1, 0]
2 banana [0, 1]
3 apple [0]
4 bananna []

3. Compare Values for large DataFrames

import numpy as np
arr = df['val'].values
comparisons = (arr[:, np.newaxis] == arr[np.newaxis, :])  # Full matrix
upper_tri = np.triu(comparisons, k=1)

result for array([16, 19, 15, 19, 15]):

array([[False, False, False, False, False],
       [False, False, False,  True, False],
       [False, False, False, False,  True],
       [False, False, False, False, False],
       [False, False, False, False, False]])

Notes

  • For large DataFrames, this apply method can be slow due to Python loops.
  • Customize the comparison (e.g., == to > or a function like Levenshtein distance for strings).
  • For fully vectorized alternatives, consider NumPy broadcasting if the output format allows (e.g., upper triangular matrix).

Resources