Pandas vs Julia - cheat sheet and comparison
This is a Python/Pandas vs Julia cheatsheet and comparison. You can find what is the equivalent of Pandas in Julia or vice versa. You can find links to the documentation and other useful Pandas/Julia resources.
The table below show the useful links for both:
Pandas | Julia | |
---|---|---|
data analysis tool | high performance language | |
site | https://pandas.pydata.org/ | https://julialang.org/ |
docs | https://pandas.pydata.org/docs/ | https://docs.julialang.org/en/v1/ |
packages | https://pypi.org/ | https://juliapackages.com/ |
repo | https://github.com/pandas-dev/pandas | https://github.com/JuliaLang/julia |
Below you can find equivalent code between Pandas and Julia. Have in mind that some examples might differ due to different indexing.
Setup
import pandas as pd
import numpy as np
pip install pandas
https://pypi.org/
Data Structures
s = pd.Series(['a', 'b', 'c'], index=[0 , 1, 2])
s[0]
df = pd.DataFrame(
{'col_1': [11, 12, 13],
'col_2': [21, 22, 23]},
index=[0, 1, 3])
import numpy as np
import pandas as pd
data=np.random.randint(0,10,size=(10, 3))
df = pd.DataFrame(data, columns=list('abc'))
Read
df = pd.read_csv('file.csv')
pd.read_json('file.json')
pd.read_csv('https://example.com/file.csv')
df = pd.read_fwf('delim_file.txt')
Write
df.to_csv('file.csv')
df.to_json(filename)
Inspect Data
df.head(6)
df.tail(6)
df.describe()
df.loc[:, :'a'].describe()
df['A'].mean()
Select
df.loc[1:3, :]
df.loc[[1, 2, 3], :]
df.loc[:, ['a', 'b']].copy()
df.loc[:, ['a']]
df.loc[1:3, ['b', 'a']]
df.loc[[3,1], ['b', 'a']]
df[df['a'].isna()]
df['a'].dropna()
Add rows/columns
df['new col'] = df['col'] * 100
df['new col'] = False
df.loc[-1] = [1, 2, 3]
df.append(df2, ignore_index = True)
Drop rows/columns/nan
s.drop(1)
s.drop([1, 2])
df.drop('b' , axis=1)
df.dropna()
df.dropna()
df.dropna(axis=1)
Sort values/index
sorted([2,3,1])
sorted([2,3,1], reverse=True)
df['a'].sort_values()
df.sort_values(['a', 'b'], ascending=[False, True])
Filter
df.loc[:, df.isna().any()]
df[df['col_1'] > 100]
df[(df['a']=='a')&(df['b']>=10)]
df[df['a'] == 'test']
df[(df['a'] == 'test') & (df['b'] == 'a2') ]
Group by
df.groupby('a')
df.groupby(['a', 'b']).c.sum()
df['a'].value_counts()
Convert
df['a'].fillna(0)
df.replace('..', None)
df['col_1'].astype('int64')
pd.to_datetime(df['date'], format='%Y-%m-%d')
Install Julia Packages
To install new packages in Julia we can also use the Julia Package manager by:
- open Linux Terminal
- start Julia -
julia
- Type
]
(right bracket). You don’t have to hit Return.- Termimal will change to
(@v1.8) pkg>
- Termimal will change to
- Type add <package_name> to add a package
- you can provide the names of several packages separated by spaces.
Control-C
to exit the package manager
Example: (v1.8) pkg> add JSON StaticArrays
Differences: Julia and Pandas
Pandas and Julia are both popular tools for data analysis and manipulation. Some key differences between them:
Indexing
One big difference between Julia and Pandas is indexing:
- Julia - 1 based (can be configured)
* What's the big deal? 0 vs 1 based indexing
* Why does Julia adopt 1-based index?
* Pandas - 0 based
Syntax
Personally I prefer SQL syntax over both Julia and Pandas. I can work fine with both of them. As I have more experience with Python I would go with Pandas. Some people consider Julia to have better syntax since it was designed for data science. Example of syntax difference between Julia and Pandas:
# pandas
import pandas as pd
df = pd.read_csv('sales_data.csv')
totals = df.groupby('product')['sales'].sum()
# julia
using DataFrames
using CSV
df = DataFrame(CSV.read("sales_data.csv"))
totals = combine(groupby(df, :product), :sales => sum)
Performance
In general Julia is faster for most operations and bigger datasets. For smaller datasets Pandas might be close or even better than Julia. The reason is for compilation time for Julia.
To test performance we can use dataset with 10M rows - Game Recommendations on Steam:
# pandas
%%time
import pandas as pd
df = pd.read_csv('recommendations.csv')
df['hours'].mean()
# julia
@time begin
using CSV, DataFrames
df = CSV.File("recommendations.csv") |> DataFrame
result = mean(df[:, "hours"])
end
The results are:
- Pandas
- CPU times: user 5.67 s, sys: 1.85 s, total: 7.52 s
- Wall time: 7.71 s
- Julia
- 7.257497 seconds (1.13 k allocations: 1.349 GiB, 2.63% gc time)
While for dataset - 12M rows we get:
- Pandas
- CPU times: user 34.8 s, sys: 3.74 s, total: 38.5 s
- Wall time: 42.4 s
- Julia
- 29.964544 seconds (162.12 M allocations: 9.878 GiB, 15.79% gc time)
First julia execution is slower so we take the second one.
Libraries and Ecosystem
Pandas has a bigger community and ecosystem. The Python libraries offers greater variety of Packages in many areas:
- web scraping
- data science
- science
- etc
Language Features
I prefer Julia for distributed computing and parallel computing. Pandas seems for me much better for visualization and EDA.
Learning Curve
Again it depends on personal choice. Python is considered as one of the best programming languages for beginners. Julia surpassed Python in recent surveys for loved language:
stackoverflow survey - Most loved, dreaded, and wanted
Note: I need to add that I'm still learning and discovering Julia - so so statements above might change in future :)
Pandas vs Julia docs
Pandas | Julia | |
---|---|---|
NA | NA | missing |
Boolean | False/True | false/true |
docs | https://pandas.pydata.org/docs/ | https://docs.julialang.org/en/v1/ |
packages | https://pypi.org/ | https://juliapackages.com/ |
repo | https://github.com/pandas-dev/pandas | https://github.com/JuliaLang/julia |
basics | https://pandas.pydata.org/docs/user_guide/basics.html | https://docs.julialang.org/en/v1/base/punctuation/ |
start | https://pandas.pydata.org/docs/getting_started/index.html | https://juliadatascience.io/ |
Summary
In summary, Pandas and Julia are both powerful tools for data analysis, but they have different strengths and weaknesses.
Pandas has a larger ecosystem of tools and is generally easier to learn. Julia is faster and has some unique language features that can make it more powerful for certain types of data analysis tasks.
Ultimately, the choice between Pandas and Julia depends on your specific requirements and preferences.
Resources
- Julia Comparison with the Python package Pandas
- Pandas Cheat Sheet for Data Science
- Pandas vs SQL Cheat Sheet
- julia notebook
- pandas notebook