How to Extract Structured Data from PDFs with AI | Reformat | Reformat

AI/ML

How to Extract Structured Data from PDFs with AI

Learn how to extract tables, invoices, receipts, and form data from PDF documents using Python and AI — with code examples for every approach.

Reformat TeamMarch 16, 20263 min read

The PDF Data Extraction Problem

PDFs are the most common business document format, but extracting structured data from them is notoriously difficult because:

PDFs store visual layout, not data structure. A table in a PDF is just text positioned at specific coordinates.
There's no standard way to encode tables, forms, or structured data in PDF format.
Scanned PDFs are just images — you need OCR before extraction.
Headers, footers, and multi-page tables add complexity.

This tutorial covers four approaches, from simple to advanced:

1. PyMuPDF — built-in table detection (free, fast)
2. Camelot/Tabula — specialized table extractors (free)
3. AI-based extraction — GPT-4o for complex/unstructured documents
4. Reformat's AI Data Extractor — upload and extract with zero code

Approach 1 — PyMuPDF Built-in Table Detection

PyMuPDF (fitz) added table detection in recent versions. It's fast and works well for simple tables:

import fitz  # PyMuPDF
import json

def extract_tables(pdf_path):
    doc = fitz.open(pdf_path)
    all_tables = []
    
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table in tables:
            rows = table.extract()
            all_tables.append({
                "page": page_num + 1,
                "headers": rows[0] if rows else [],
                "rows": rows[1:] if len(rows) > 1 else []
            })
    
    doc.close()
    return all_tables

tables = extract_tables("invoice.pdf")
for t in tables:
    print(f"Page {t['page']}: {len(t['rows'])} rows")
    print(f"Headers: {t['headers']}")

Best for: Simple tables with clear borders. Struggles with borderless tables and complex layouts.

Try These Tools

Mentioned in this article — free, no sign-up required.

AI Data Extractor

Extract structured data from documents

PDF to CSV

Convert PDF documents to Comma-Separated Values format

PDF to Excel

Convert PDF tables to editable Excel spreadsheets

pdf extractiondata extractioninvoice processingaipythonautomation

Related Articles

How to Build a RAG Chatbot with LangChain and OpenAI in 2026

4 min read

AI Tools for Students 2026 — Summarize Lectures Convert Notes Translate PDFs

12 min read

Turn Scanned Documents into Searchable PDFs with AI OCR

12 min read

Back to all articles

Approach 2 — AI-Powered Extraction with GPT-4o

For invoices, receipts, and unstructured documents, AI is the most accurate approach:

from openai import OpenAI
import fitz
import json

client = OpenAI()

def ai_extract(pdf_path, extraction_type="invoice"):
    # Extract text from PDF
    doc = fitz.open(pdf_path)
    text = "\n".join(page.get_text() for page in doc)
    doc.close()
    
    prompts = {
        "invoice": "Extract: invoice number, date, vendor, line items (description, qty, price, total), subtotal, tax, grand total. Return JSON.",
        "receipt": "Extract: store name, date, items (name, price), subtotal, tax, total, payment method. Return JSON.",
        "resume": "Extract: name, email, phone, skills, work experience, education. Return JSON."
    }
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Extract structured data. Return valid JSON only."},
            {"role": "user", "content": f"{prompts[extraction_type]}\n\nDocument:\n{text[:30000]}"}
        ],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    
    return json.loads(response.choices[0].message.content)

# Extract invoice data
data = ai_extract("invoice.pdf", "invoice")
print(json.dumps(data, indent=2))

The response_format={"type": "json_object"} ensures the output is always valid JSON, avoiding parsing errors.

Converting Extracted Data to Excel or CSV

Once you have structured data, convert it to a spreadsheet:

import pandas as pd

def to_excel(extracted_data, output_path):
    # For invoice line items
    if "line_items" in extracted_data:
        df = pd.DataFrame(extracted_data["line_items"])
        
        # Add summary row
        summary = pd.DataFrame([{
            "description": "TOTAL",
            "total": extracted_data.get("grand_total", "")
        }])
        df = pd.concat([df, summary], ignore_index=True)
        
        df.to_excel(output_path, index=False)
        print(f"Saved to {output_path}")

to_excel(data, "invoice_data.xlsx")

Or skip all the code and use Reformat's PDF to Excel or AI Data Extractor tool — upload your PDF and download a clean spreadsheet in seconds.

Batch Processing Multiple PDFs

For processing many documents at once:

import os
from pathlib import Path

def batch_extract(input_dir, output_dir, extraction_type="invoice"):
    Path(output_dir).mkdir(exist_ok=True)
    
    pdfs = [f for f in os.listdir(input_dir) if f.endswith(".pdf")]
    results = []
    
    for pdf_file in pdfs:
        pdf_path = os.path.join(input_dir, pdf_file)
        try:
            data = ai_extract(pdf_path, extraction_type)
            data["source_file"] = pdf_file
            results.append(data)
            print(f"Extracted: {pdf_file}")
        except Exception as e:
            print(f"Failed: {pdf_file} — {e}")
    
    # Save all results as one JSON file
    with open(os.path.join(output_dir, "all_extracted.json"), "w") as f:
        json.dump(results, f, indent=2)
    
    print(f"Processed {len(results)}/{len(pdfs)} files")

batch_extract("./invoices/", "./extracted/", "invoice")

This processes all PDFs in a directory and outputs a single JSON file with all extracted data. At gpt-4o-mini prices, processing 100 invoices costs about $0.50.