AI/ML

How to Extract Structured Data from PDFs with AI

Learn how to extract tables, invoices, receipts, and form data from PDF documents using Python and AI — with code examples for every approach.

March 16, 20263 min read

The PDF Data Extraction Problem

PDFs are the most common business document format, but extracting structured data from them is notoriously difficult because:

  • PDFs store visual layout, not data structure. A table in a PDF is just text positioned at specific coordinates.
  • There's no standard way to encode tables, forms, or structured data in PDF format.
  • Scanned PDFs are just images — you need OCR before extraction.
  • Headers, footers, and multi-page tables add complexity.

This tutorial covers four approaches, from simple to advanced:

  • 1. PyMuPDF — built-in table detection (free, fast)
  • 2. Camelot/Tabula — specialized table extractors (free)
  • 3. AI-based extraction — GPT-4o for complex/unstructured documents
  • 4. Reformat's AI Data Extractor — upload and extract with zero code

Approach 1 — PyMuPDF Built-in Table Detection

PyMuPDF (fitz) added table detection in recent versions. It's fast and works well for simple tables:

import fitz  # PyMuPDF

import json

def extract_tables(pdf_path):

doc = fitz.open(pdf_path)

all_tables = []

for page_num, page in enumerate(doc):

tables = page.find_tables()

for table in tables:

rows = table.extract()

all_tables.append({

"page": page_num + 1,

"headers": rows[0] if rows else [],

"rows": rows[1:] if len(rows) > 1 else []

})

doc.close()

return all_tables

tables = extract_tables("invoice.pdf")

for t in tables:

print(f"Page {t['page']}: {len(t['rows'])} rows")

print(f"Headers: {t['headers']}")

Best for: Simple tables with clear borders. Struggles with borderless tables and complex layouts.

Approach 2 — AI-Powered Extraction with GPT-4o

For invoices, receipts, and unstructured documents, AI is the most accurate approach:

from openai import OpenAI

import fitz

import json

client = OpenAI()

def ai_extract(pdf_path, extraction_type="invoice"):

# Extract text from PDF

doc = fitz.open(pdf_path)

text = "\n".join(page.get_text() for page in doc)

doc.close()

prompts = {

"invoice": "Extract: invoice number, date, vendor, line items (description, qty, price, total), subtotal, tax, grand total. Return JSON.",

"receipt": "Extract: store name, date, items (name, price), subtotal, tax, total, payment method. Return JSON.",

"resume": "Extract: name, email, phone, skills, work experience, education. Return JSON."

}

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[

{"role": "system", "content": "Extract structured data. Return valid JSON only."},

{"role": "user", "content": f"{prompts[extraction_type]}\n\nDocument:\n{text[:30000]}"}

],

response_format={"type": "json_object"},

temperature=0.1

)

return json.loads(response.choices[0].message.content)

# Extract invoice data

data = ai_extract("invoice.pdf", "invoice")

print(json.dumps(data, indent=2))

The response_format={"type": "json_object"} ensures the output is always valid JSON, avoiding parsing errors.

Converting Extracted Data to Excel or CSV

Once you have structured data, convert it to a spreadsheet:

import pandas as pd

def to_excel(extracted_data, output_path):

# For invoice line items

if "line_items" in extracted_data:

df = pd.DataFrame(extracted_data["line_items"])

# Add summary row

summary = pd.DataFrame([{

"description": "TOTAL",

"total": extracted_data.get("grand_total", "")

}])

df = pd.concat([df, summary], ignore_index=True)

df.to_excel(output_path, index=False)

print(f"Saved to {output_path}")

to_excel(data, "invoice_data.xlsx")

Or skip all the code and use Reformat's PDF to Excel or AI Data Extractor tool — upload your PDF and download a clean spreadsheet in seconds.

Batch Processing Multiple PDFs

For processing many documents at once:

import os

from pathlib import Path

def batch_extract(input_dir, output_dir, extraction_type="invoice"):

Path(output_dir).mkdir(exist_ok=True)

pdfs = [f for f in os.listdir(input_dir) if f.endswith(".pdf")]

results = []

for pdf_file in pdfs:

pdf_path = os.path.join(input_dir, pdf_file)

try:

data = ai_extract(pdf_path, extraction_type)

data["source_file"] = pdf_file

results.append(data)

print(f"Extracted: {pdf_file}")

except Exception as e:

print(f"Failed: {pdf_file} — {e}")

# Save all results as one JSON file

with open(os.path.join(output_dir, "all_extracted.json"), "w") as f:

json.dump(results, f, indent=2)

print(f"Processed {len(results)}/{len(pdfs)} files")

batch_extract("./invoices/", "./extracted/", "invoice")

This processes all PDFs in a directory and outputs a single JSON file with all extracted data. At gpt-4o-mini prices, processing 100 invoices costs about $0.50.

Try These Tools

Mentioned in this article — free, no sign-up required.

pdf extractiondata extractioninvoice processingaipythonautomation

Related Articles