In this short tutorial, we'll see how to extract tables from PDF files with Python and Pandas.

We will cover two cases of table extraction from PDF:

(1) Simple table with tabula-py

from tabula import read_pdf
df_temp = read_pdf('china.pdf')

(2) Table with merged cells

import pandas as pd
html_tables = pd.read_html(page)

Let's cover both examples in more detail as context is important.

Nice video on the topic: Easily extract tables from websites with pandas and python

Notebook: Scrape wiki tables with pandas and python.ipynb

1: Extract tables from PDF with Python

In this example we will extract multiple tables from remote PDF file: china.pdf.

We will use library called: tabula-py which can be installed by:

pip install tabula-py

The .pdf file contains 2 table:

  • smaller one
  • bigger one with merged cells
from tabula import read_pdf

file = ''

df_temp = read_pdf(file, stream=True)

After reading the data we can get a list of DataFrames which contain table data.

Let's check the first one:

FLA Audit Profile Unnamed: 0
0 Country China
1 Factory name 01001523B
2 IEM BVCPS (HK), Shen Zhen Office
3 Date of audit May 20-22, 2003
4 PC(s) adidas-Salomon
5 Number of workers 243
6 Product(s) Scarf, cap, gloves, beanies and headbands
7 Production processes Sewing, cutting, packing, embroidery, die-cutting

Which is the exact match of the first table from the PDF file.


While the second one is a bit weird. The reason is because of the merged cells which are extracted as NaN values:

Unnamed: 0 Unnamed: 1 Unnamed: 2 Findings Unnamed: 3
0 FLA Code/ Compliance issue Legal Reference / Country Law FLA Benchmark Monitor's Findings NaN
1 1. Code Awareness NaN NaN NaN NaN
2 2. Forced Labor NaN NaN NaN NaN
3 3. Child Labor NaN NaN NaN NaN
4 4. Harassment or Abuse NaN NaN NaN NaN


How to workaround this problem we will see in the next step.
Some cells are extracted to multiple rows as we can see from the image:

2: Extract tables from PDF - keep format

Often tables in PDF files have:

  • strange format
  • merged cells
  • strange symbols

Most libraries and software are not able to extract them in a reliable way.

To extract complex table from PDF files with Python and Pandas we will do:

  • download the file (it's possible without download)
  • convert the PDF file to HTML
  • extract the tables with Pandas

2.1 Convert PDF to HTML

First we will download the file from: china.pdf.

Then we will convert it to HTML with the library: pdftotree.

import pdftotree

page = pdftotree.parse('china.pdf', html_path=None, model_type=None, model_path=None, visualize=False)

library can be installed by:

pip install pdftotree

2.2 Extract tables with Pandas

Finally we can read all the tables from this page with Pandas:

import pandas as pd
html_tables = pd.read_html(page)

Which will give us better results in comparison to tabula-py


2.3 HTMLTableParser

As alternatively to Pandas, we can use the library: html-table-parser-python3 to parse the HTML tables to Python lists.

from html_table_parser.parser import HTMLTableParser

p = HTMLTableParser()

it convert the HTML table to Python list:

[['', ''], ['Country', 'China'], ['Factory  name', '01001523B'], ['IEM', 'BVCPS  (HK),  Shen  Zhen  Office'], ['Date  of  audit', 'May  20-22,  2003'], ['PC(s)', 'adidas-Salomon'], ['Number  of  workers', '243'], ['Product(s)', 'Scarf,  cap,  gloves,  beanies  and  headbands']]

Now we can convert the list to Pandas DataFrame:

import pandas as pd

To install this library we can do:

pip install html-table-parser-python3

There are two differences to Pandas:

  • returns list of values
  • instead of NaN values - there are empty strings

3. Python Libraries for extraction from PDF files

Finally let's find a list of useful Python libraries which can help in PDF parsing and extraction:

3.1 Python PDF parsing

3.2 Parse HTML tables

3.3 Example PDF files

Finally you can find example PDF files where you can test table extraction with Python and Pandas:

tabula test PDF files