Pandas vs SQL Cheat Sheet

With this SQL & Pandas cheat sheet, we'll have a valuable reference guide for Pandas and SQL. We can convert or run SQL code in Pandas or vice versa. Consider it as Pandas cheat sheet for people who know SQL.

The cheat sheet covers basic querying tables, filtering data, aggregating data, modifying and advanced operations. It includes the most popular operations which are used on a daily basis with SQL or Pandas.

Keep reading until the end for bonus tips and conversion between Pandas and SQL.

Select

Selecting data with Pandas vs SQL

df

SELECT * FROM tab;

query data from all columns

df[['col_1', 'col_2']]

SELECT col_1, col_2 FROM tab;

select subset of columns and get all rows

df.assign(new_col = df['col_1'] / df['col_2'])

SELECT *, col_1/col_2 as new_col FROM tab;

Aliases - add a calculated column based on other columns

df[['col 1']]

SELECT `col 1` FROM tab;

Column name with space - `col 1`

df.sort_values(by='col_1', ascending=True)

SELECT * FROM tab ORDER BY col_1 ASC;

Sort values by col_1 in ascending or descending (DESC) order

Where

Filtering in SQL vs Pandas

df[df['col_1'] == '11']

SELECT * FROM tab WHERE col_1 = '11';

Filtering on single condition

df[(df['col_1'] == 11) &
   (df['col_2'] > 5)]

SELECT * FROM tab WHERE col_1 = 11 AND col_2 > 5;

Filtering on multiple conditions

df[df['col_2'].isna()]

SELECT * FROM tab WHERE col_2 IS NULL;

NULL checking is done using the isna()

df[df['col_1'].notna()]

SELECT * FROM tab WHERE col_1 IS NOT NULL;

NOT NULL checking is done using the notna()

df[df.col_1 > df.col_2]

SELECT * FROM tab WHERE col_1 > col_2;

Where clause with 2 SQL columns

df.query('col_1 == col_2')

SELECT * FROM tab WHERE col_1 = col_2;

Filter with Pandas Query

df.query('col_1 == `col 2`')

SELECT * FROM tab WHERE col_1 = `col 2`;

Column name with space - `col 2` in where clause

Like, and, or

Operators(Text, Logical) in Pandas vs SQL

df[df['col_1'].str.contains('i', na=False)]

WHERE col_1 LIKE '%i%'

Finds values which contain `i`. Column need to be string - .astype(str)

df[df['col_1'].str.contains('sh|rd', regex=True, na=True)]

WHERE col_1 LIKE '%sh%' OR col_1 LIKE '%rd%'

Finds any values which contain `sh` or `rd`

df[df['col_1'].str.startswith('h', na=False)]

WHERE col_1 LIKE 'h%'

Finds any values that start with `h`

df[df['col_1'].astype(str).str.endswith('k', na=False)]

WHERE col_1 LIKE '%k'

Finds any values that ends with `k`

(df['col_1'] == '11') & (df['col_2'] > 5)

WHERE col_1 = '11' AND col_2 > 5;

AND = SQL - `and`, Pandas - `&`

(df['col_1'] == '11') | (df['col_2'] > 5)

WHERE col_1 = '11' OR col_2 > 5;

OR = SQL - `or`, Pandas - `|`

df[df['col_1'].isin([1,2,3])]

SELECT * FROM tab WHERE col_1 in (1,2,3);

IN operator - find values from list of values

df[df['col_1'].between(1, 5)]

SELECT * FROM tab WHERE col_1 BETWEEN 1 AND 5;

BETWEEN operator - find values in a range

Group by

Group by operations in SQL vs Pandas

df.groupby('col_1').size()

df.groupby('col_1')['col_2'].count()

df.col_1.value_counts()

SELECT col_1, count(*) FROM tab GROUP BY col_1;

count records in each group( 3 versions in Pandas )

import numpy as np
df.groupby('col_1').agg({'col_2': np.mean, 'col_1': np.size})

SELECT col_1, AVG(col_2), COUNT(*) FROM tab GROUP BY col_1;

Apply multiple statistical functions

df.groupby(['col_1', 'col_2']).agg({'col_3': [np.size, np.mean]})

SELECT col_1, col_2, COUNT(*), AVG(col_3) FROM tab GROUP BY col_1, col_2;

Grouping by multiple columns, multiple functions

#  group by
g = df.groupby('col_1') 

# having count(*) > 10
g.filter(lambda x: len(x) > 10)['col_1']

SELECT col_1, count(*) FROM tab GROUP BY col_1 HAVING count(*) > 10;

HAVING - Group by column and filtering contidion on the groups

df.groupby('col_1').col_2.nunique()

SELECT count(distinct col_2) FROM tab GROUP BY col_1;

count(distinct) - count unique elements in group

Join

Join in SQL and Pandas

pd.merge(df1, df2, on='key')

SELECT * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

Inner join of 2 table/dataframes(1)

pd.merge(df1, df2, on='key', how='left')

SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.key = t2.key;

Left Outer join

pd.merge(df1, df2, on='key', how='right')

SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.key = t2.key;

Right Join

pd.merge(df1, df2, on='key', how='outer')

SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.key = t2.key;

Full Join (not working on MySQL)

pd.merge(df1, df2, left_on= ['col_1', 'col_2'], right_on= ['col_3', 'col_4'], how = 'right')

SELECT * FROM t1 INNER JOIN t2 ON t1.col_1 = t2.col_3 AND t1.col_2 = t2.col_4;

Join on columns with different names

m = pd.merge(df1, df2, how='left', on=['col_1', 'col_2'])

pd.merge(m, df3[['col_1', 'col_2', 'col_3']], how='left', on=['col_1', 'col_3'])

SELECT t1.col_a, t2.col_b, t3.col_c FROM t1 LEFT OUTER JOIN t2 ON t1.col_1 = t2.col_1 AND t1.col_2 = t2.col_2 LEFT OUTER JOIN t3 ON t1.col_1 = t3.col_1 AND t1.col_3 = t3.col_3

join multiple dataframes on multiple columns

Union

Union in SQL and Pandas

pd.concat([df1, df2])

SELECT * FROM df1 UNION ALL SELECT * FROM df2;

Unioun All(columns must have same number of columns)

cols= ['col_1', 'col_2']
pd.concat([df1[cols], df2[cols]])

SELECT col_1, col_2 FROM t1 UNION ALL SELECT col_1, col_2 FROM t2;

Unioun All

pd.concat([df1, df2]).drop_duplicates()

SELECT col_1, col_2 FROM t1 UNION SELECT col_1, col_2 FROM t1;

Unioun All ( remove duplicate rows)

Limit

Limit in SQL and Pandas

df.head(10)

SELECT * FROM tab LIMIT 10;

Get top rows

df.tail(10)

SELECT * FROM tab ORDER BY id DESC LIMIT 10;

Get last N rows

lim = 2
offset = 5
df.sort_values('col_1', ascending=False).iloc[offset:lim+offset]

SELECT * FROM tab ORDER BY col_1 DESC LIMIT 2 OFFSET 5;

return only top 2 records, start on record 6 (OFFSET 5)

Update

Update in SQL vs Pandas

df.loc[df['col_1'] < 2, 'col_1'] *= 2

UPDATE tab SET col_1 = col_1*2 WHERE col_1 < 2;

Update all rows for 1 column with condition

df1['name'] = np.where(df2['id']==1,df2['name'],df1['name'])

UPDATE t1, ( SELECT * FROM t2 WHERE id = 1 ) AS temp SET t1.name = temp.name WHERE t1.id = 1;

Update based on select from another table / dataframe

Delete

Delete in SQL vs Pandas

df = df.loc[df['col_1'] > 9]

DELETE FROM tab WHERE col_1 > 9;

Delete rows with condition

df1.drop(df1[(df1.id.isin(df2.id) & (df1.id==1))].index)

DELETE t1 FROM df1 as t1 JOIN df2 as t2 ON t1.id = t2.id WHERE t2.id = 1;

Delete rows with condition based on another table / dataframe

Insert

Insert in SQL vs Pandas

data = {'col_1': 1, 'col_2': '11'}

df = df.append(data, ignore_index = True)

INSERT INTO tab(col_1, col_2) VALUES (1, '11');

Add new data/rows to a table/dataframe

Now we can explore the conversion between Pandas and SQL.

The SQL syntax is based and tested on MySQL 8.

Additional resources:

Pandas equivalent of SQL

Table below shows the most used equivalents between SQL and Pandas:

Category	Pandas	SQL
Data Structure	DataFrames, Series	Tables, Row, Column
Querying	.loc , .iloc , boolean indexing	SELECT, WHERE
Filtering	.query() , boolean indexing	WHERE
Sorting	.sort_values() , .sort_index()	ORDER BY
Grouping	.groupby() , .agg()	GROUP BY, AGGREGATE
Joins	.merge() , .join(), .concat()	JOIN, UNION
Aggregations	.agg() , .apply()	AGGREGATE
Data Transformation	.apply() , .map() , .replace()	replace(`column`, 'old', 'new')
Count Unique	.nunique(), .agg(['count', 'nunique'])	count(distinct)
Data Cleaning	.dropna() , .fillna()	IS NULL; IS NOT NULL;
Data Type Conversion	.astype()	CAST

Differences between SQL and Pandas

Usually there are multiple ways to achieve something in Pandas or SQL. Pandas offers a bigger variety of options.

Often the solution depends on the performance.

There are some key differences when you work with SQL or Pandas:

Join in SQL vs Pandas

(1) In Pandas if both key columns contain rows with NULL value, those rows will be matched against each other.

This is not the case for SQL join behavior and can lead to unexpected results in Pandas

Copies vs. in place operations

Usually in Pandas operations return copies of the Series/DataFrame. To change this behavior we can use:

df.sort_values("col_1", inplace=True)

Create a new DataFrame:

sorted_df = df.sort_values("col1")

or update the original one:

df = df.sort_values("col1")

Pandas to SQL

To convert or export Pandas DataFrame to SQL we can use method: to_sql():

There are several important parameters which need to be used for this method:

name - table name in the database
con - DB connection. sqlalchemy.engine.(Engine or Connection) or sqlite3.Connection
if_exists - behavior if the table exists in the DB
index - convert DataFrame index to a table column
chunksize - number of rows in each batch to be written at a time

A very basic example is given below:

from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)

df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']})

df.to_sql('users', con=engine)

We can check results by using SQL query like:

engine.execute("SELECT * FROM users").fetchall()

result:

[(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]

The full documentation is available on this link: to_sql()

Pandas from SQL

Pandas offers method read_sql() to get data from SQL or DB to a DataFrame or Series.

The method can be used to read SQL connection and fetch data:

pd.read_sql('SELECT col_1, col_2 FROM tab', conn)

where conn is SQLAlchemy connectable, str, or sqlite3 connection.

To find more you can check: pandas.read_sql()

Pandas and SQL with SQLAlchemy and PyMySQL

Alternatively we can convert SQL table to Pandas DataFrame with SQLAlchemy and PyMySQL.

To do so check: How to Convert MySQL Table to Pandas DataFrame / Python Dictionary

It includes many different step by step examples and options.

You can find how to connect Python to SQL using libraries - SQLAlchemy and PyMySQL.

Run SQL code in Pandas

To run SQL syntax in Python and Pandas we need to install Python package - pandasql by:

pip install -U pandasql

The package allows us to query Pandas DataFrames using SQL syntax.

It uses SQLite syntax. More information can be found here - pandasql.

Below we can see a basic example of running SQL syntax to Pandas DataFrame :

from pandasql import sqldf
import pandas as pd

q = "SELECT * FROM df LIMIT 5"
sqldf(q, globals())

Method sqldf() requires 2 parameters:

the SQL query
globals() or locals() function

Generate SQL statements in Pandas

SQL create table

To generate SQL statements from Pandas DataFrame We can use the method pd.io.sql.get_schema().

So the get the create SQL statement for a given DataFrame we can use:

pd.io.sql.get_schema(df.reset_index(), 'tab')

where 'tab' is the name of the DB table.

SQL insert table

Generating SQL insert statements from a DataFrame can be achieved by:

sql_insert = []
for index, row in df.iterrows():       
    sql_insert.append('INSERT INTO `tab` ('+ str(', '.join(df.columns))+ ') VALUES '+ str(tuple(row.values)))

Where:

df is the name of the source DataFrame
tab is the target table

We are iterating over all rows of the DataFrame and generating insert statements.

Conclusion

We covered basic and advanced operations with Pandas and SQL. You can use this to quickly transfer SQL to Pandas and the reverse.

If you like content like this - make sure to subscribe to get the latest updates and resources. This post is only a part from a serires related to cheat sheets related to data science.

> Basic concepts

> Installations

> Series

> DataFrame

> Create

> Data Types

> Exercise

> Cheat Sheet

> Basic concepts

> Row

> Column

> Index

> MultiIndex

> Exercise

> Basic concepts

> read_csv()

> read_excel()

> Kaggle

> Exercise

> read_xml()

> read_json()

> to_csv()

> to_dict()

> to_json()

> Basic concepts

> groupby()

> Reshape

> melt()

> Exercise

> Pivot

> merge()

> Filter

> Basic concepts

> replace()

> split()

> Regex

> Search

> Exercise

> Find

> Basic concepts

> apply()

> aggfunc

> Convert

> count()

> Other

> Exercise

> map()

> Basic concepts

> Data Validation

> Data Cleaning

> Duplicate

> Time Series

> Pandas Error

> Get

> Basic concepts

> Styling

> Table

> Display

> DataIsBeautiful

> Beginners

> Data Science Projects

> Newsletter

Select

Where

Like, and, or

Group by

Join

Union

Limit

Update

Delete

Insert

Pandas equivalent of SQL

Differences between SQL and Pandas

Join in SQL vs Pandas

Copies vs. in place operations

Pandas to SQL

Pandas from SQL

Pandas and SQL with SQLAlchemy and PyMySQL

Run SQL code in Pandas