Learn how to compare every value in a pandas DataFrame column with all following values efficiently.
Sample Data
import pandas as pd
val = [16, 19, 15, 19, 15]
df = pd.DataFrame({'val': val})
| val | |
|---|---|
| 0 | 16 |
| 1 | 19 |
| 2 | 15 |
| 3 | 19 |
| 4 | 15 |
1. Compare with Subsequent Values Using apply
Create a new column with lists of comparison results (e.g., 1 if equal, 0 otherwise) for all later rows:
df['match'] = df.apply(
lambda row: [
1 if row['val'] == df.loc[idx, 'val'] else 0
for idx in range(row.name + 1, len(df))
],
axis=1
)
Result:
| val | match | |
|---|---|---|
| 0 | 16 | [0, 0, 0, 0] |
| 1 | 19 | [0, 1, 0] |
| 2 | 15 | [0, 1] |
| 3 | 19 | [0] |
| 4 | 15 | [] |
This approach works row-wise and is suitable for moderate-sized DataFrames.
2. Compare Text Values with Subsequent for similarity
You need to install library: python-Levenshtein
!pip install python-Levenshtein
The idea is to match all similarities - i.e. apple and appl:
from Levenshtein import ratio
df_str = pd.DataFrame({'text': ['apple', 'appl', 'banana', 'apple', 'bananna']})
def is_similar(a, b, threshold=0.8):
return 1 if ratio(a, b) >= threshold else 0
df_str['similar_later'] = df_str.apply(
lambda row: [
is_similar(row['text'], df_str.loc[idx, 'text'])
for idx in range(row.name + 1, len(df_str))
],
axis=1
)
df_str
result:
| text | similar_later | |
|---|---|---|
| 0 | apple | [1, 0, 1, 0] |
| 1 | appl | [0, 1, 0] |
| 2 | banana | [0, 1] |
| 3 | apple | [0] |
| 4 | bananna | [] |
3. Compare Values for large DataFrames
import numpy as np
arr = df['val'].values
comparisons = (arr[:, np.newaxis] == arr[np.newaxis, :]) # Full matrix
upper_tri = np.triu(comparisons, k=1)
result for array([16, 19, 15, 19, 15]):
array([[False, False, False, False, False],
[False, False, False, True, False],
[False, False, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False]])
Notes
- For large DataFrames, this
applymethod can be slow due to Python loops. - Customize the comparison (e.g.,
==to>or a function like Levenshtein distance for strings). - For fully vectorized alternatives, consider NumPy broadcasting if the output format allows (e.g., upper triangular matrix).