The PDF Data Extraction Problem
PDFs are the most common business document format, but extracting structured data from them is notoriously difficult because:
- PDFs store visual layout, not data structure. A table in a PDF is just text positioned at specific coordinates.
- There's no standard way to encode tables, forms, or structured data in PDF format.
- Scanned PDFs are just images — you need OCR before extraction.
- Headers, footers, and multi-page tables add complexity.
This tutorial covers four approaches, from simple to advanced:
- 1. PyMuPDF — built-in table detection (free, fast)
- 2. Camelot/Tabula — specialized table extractors (free)
- 3. AI-based extraction — GPT-4o for complex/unstructured documents
- 4. Reformat's AI Data Extractor — upload and extract with zero code
Approach 1 — PyMuPDF Built-in Table Detection
PyMuPDF (fitz) added table detection in recent versions. It's fast and works well for simple tables:
import fitz # PyMuPDF
import json
def extract_tables(pdf_path):
doc = fitz.open(pdf_path)
all_tables = []
for page_num, page in enumerate(doc):
tables = page.find_tables()
for table in tables:
rows = table.extract()
all_tables.append({
"page": page_num + 1,
"headers": rows[0] if rows else [],
"rows": rows[1:] if len(rows) > 1 else []
})
doc.close()
return all_tables
tables = extract_tables("invoice.pdf")
for t in tables:
print(f"Page {t['page']}: {len(t['rows'])} rows")
print(f"Headers: {t['headers']}")
Best for: Simple tables with clear borders. Struggles with borderless tables and complex layouts.Approach 2 — AI-Powered Extraction with GPT-4o
For invoices, receipts, and unstructured documents, AI is the most accurate approach:
from openai import OpenAI
import fitz
import json
client = OpenAI()
def ai_extract(pdf_path, extraction_type="invoice"):
# Extract text from PDF
doc = fitz.open(pdf_path)
text = "\n".join(page.get_text() for page in doc)
doc.close()
prompts = {
"invoice": "Extract: invoice number, date, vendor, line items (description, qty, price, total), subtotal, tax, grand total. Return JSON.",
"receipt": "Extract: store name, date, items (name, price), subtotal, tax, total, payment method. Return JSON.",
"resume": "Extract: name, email, phone, skills, work experience, education. Return JSON."
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract structured data. Return valid JSON only."},
{"role": "user", "content": f"{prompts[extraction_type]}\n\nDocument:\n{text[:30000]}"}
],
response_format={"type": "json_object"},
temperature=0.1
)
return json.loads(response.choices[0].message.content)
# Extract invoice data
data = ai_extract("invoice.pdf", "invoice")
print(json.dumps(data, indent=2))
The response_format={"type": "json_object"} ensures the output is always valid JSON, avoiding parsing errors.
Converting Extracted Data to Excel or CSV
Once you have structured data, convert it to a spreadsheet:
import pandas as pd
def to_excel(extracted_data, output_path):
# For invoice line items
if "line_items" in extracted_data:
df = pd.DataFrame(extracted_data["line_items"])
# Add summary row
summary = pd.DataFrame([{
"description": "TOTAL",
"total": extracted_data.get("grand_total", "")
}])
df = pd.concat([df, summary], ignore_index=True)
df.to_excel(output_path, index=False)
print(f"Saved to {output_path}")
to_excel(data, "invoice_data.xlsx")
Or skip all the code and use Reformat's PDF to Excel or AI Data Extractor tool — upload your PDF and download a clean spreadsheet in seconds.
Batch Processing Multiple PDFs
For processing many documents at once:
import os
from pathlib import Path
def batch_extract(input_dir, output_dir, extraction_type="invoice"):
Path(output_dir).mkdir(exist_ok=True)
pdfs = [f for f in os.listdir(input_dir) if f.endswith(".pdf")]
results = []
for pdf_file in pdfs:
pdf_path = os.path.join(input_dir, pdf_file)
try:
data = ai_extract(pdf_path, extraction_type)
data["source_file"] = pdf_file
results.append(data)
print(f"Extracted: {pdf_file}")
except Exception as e:
print(f"Failed: {pdf_file} — {e}")
# Save all results as one JSON file
with open(os.path.join(output_dir, "all_extracted.json"), "w") as f:
json.dump(results, f, indent=2)
print(f"Processed {len(results)}/{len(pdfs)} files")
batch_extract("./invoices/", "./extracted/", "invoice")
This processes all PDFs in a directory and outputs a single JSON file with all extracted data. At gpt-4o-mini prices, processing 100 invoices costs about $0.50.