In this short tutorial, we'll cover how to convert natural language numerics like M and K into numbers with Pandas and Python.
We will show two different ways for conversion of K and M to thousand and million. We will also cover the reverse case converting thousand and million to K and M.
So we will cover:
- 0.1M to 100000
- 1 K - 1000
- 10000 to 10K
- forty two - 42
Two Python libraries:
- humanize - turning a number into a fuzzy human-readable expression
- numerizer - convert natural language numerics into ints and floats
The image below show the examples:
Setup
Let's use the following DataFrame the conversion from natural language numerals to numbers:
import pandas as pd
import matplotlib.pyplot as plt
data={'day': [1, 2, 3, 4, 5],
'numeric': [22, 222, '22K', '2M', '0.01 B'],
'numbers': [110, 11000, 1000000, 33300000, 456873],
'lang': ['one', 'five', 'twelve', 'forty two', 'one hundred and five']}
df = pd.DataFrame(data,
columns=['day', 'numeric', 'numbers', 'lang'])
Our data looks like:
day | numeric | numbers | lang | |
---|---|---|---|---|
0 | 1 | 22 | 110 | one |
1 | 2 | 222 | 11000 | five |
2 | 3 | 22K | 1000000 | twelve |
3 | 4 | 2M | 33300000 | forty two |
4 | 5 | 0.01 B | 456873 | one hundred and five |
Step 1: Convert K/M to Thousand/Million
First we will start the conversion of large number abbreviations to numbers. We will map the abbreviations to the math expression.
So we will convert:
22K -> 22 * 10**3
0.1 M -> 0.1 * 10**6
mp = {'K':' * 10**3', 'M':' * 10**6', 'B':' * 10**9', 't':' * 10**12', 'q':' * 10**15', 'Q':' * 10**15'}
pd.eval(df['numeric'].replace(mp.keys(), mp.values(), regex=True))
This will give us:
array([22.0, 222.0, 22000.0, 2000000.0, 10000000.0], dtype=object)
As we can see it works fine for columns with mixed data. It works well also for numeric values with spaces like 1 M
pd.eval limit 100 rows
There seems to be limit of pd.eval
and the returned results by eval
are limited:
len(pd.eval([1 * 10**6] * 105))
result:
101
Last five results are:
[1000000,
1000000,
1000000,
1000000,
Ellipsis]
So the above code can be rewritten to:
df['cols_num'] = df['col'].replace(mp.keys(), mp.values(), regex=True)
df['col_num'] = df.apply(lambda x: eval(x.col_num), axis=1)
to make it working for more than 100 rows.
Step 2: Convert Thousand/Million to K/M
For this step we will use the Python library - humanize
. It will help us to** convert easily and reliably large numbers to human readable abbreviations**:
import humanize
df['numbers'].apply(humanize.intword)
The result contains the converted values:
0 110
1 11.0 thousand
2 1.0 million
3 33.3 million
4 456.9 thousand
Name: numbers, dtype: object
The library supports multiple languages - about 25 like:
- spanish
- russian
- french
- portuguese
Library can be installed by:
pip install humanize
Step 3: Convert words to number - one to 1
Finally let's cover the case when we need to convert language numerics into numbers:
- forty two -> 42
- twelve hundred -> 12000
- four hundred and sixty two -> 462
This time we will use Python library: numerize - which can be installed by:
pip install numerize
So the Pandas code to convert numbers is:
from numerizer import numerize
df['lang'].apply(numerize)
The output is:
0 1
1 5
2 12
3 42
4 105
Name: lang, dtype: object
Large Number Abbreviations
Below we can find a table of large number abbreviations which can be used to improve the mapping.
Abbreviation | Name | Value | Equivalent |
---|---|---|---|
K | Thousand (Kilo) | 10^ 3 | 1000 |
M | Million | 10^ 6 | 1000K |
B | Billion | 10^ 9 | 1000M |
t | trillion | 10^ 12 | 1000B |
q | quadrillion | 10^ 15 | 1000t |
Q | Quintillion | 10^ 18 | 1000q |
s | sextillion | 10^ 21 | 1000Q |
S | Septillion | 10^ 24 | 1000s |
o | octillion | 10^ 27 | 1000S |
n | nonillion | 10^ 30 | 1000o |
Conclusion
In this post, we saw how to convert human readable and language expressions to numbers. We covered the reverse case of conversion of large numbers to abbreviations.
It was shown how to map values to Python mathematical expressions and how to evaluate them. Two very useful Python libraries were used in Pandas for numeric conversion.