OCR in Python: Extract Text from Images and Scanned PDFs | Reformat | Reformat

Python

OCR in Python: Extract Text from Images and Scanned PDFs

A practical guide to extracting text from images and scanned PDF documents using Python, Tesseract OCR, and modern AI-based alternatives.

Reformat TeamMarch 19, 20263 min read

When Do You Need OCR?

Optical Character Recognition (OCR) converts images of text into machine-readable text. You need it when:

You have scanned PDFs that are just images wrapped in a PDF container
You need to extract text from photos of documents, receipts, or whiteboards
You're building a document processing pipeline that handles both digital and scanned files
You want to make scanned documents searchable

The two main approaches in 2026 are:

1. Tesseract OCR — free, open-source, runs locally, good for clean documents
2. AI-based OCR (GPT-4o Vision, Google Cloud Vision) — better accuracy on messy handwriting, costs per request

Setting Up Tesseract on Your System

Tesseract is the most widely used open-source OCR engine. Install it first:

Ubuntu/Debian:

sudo apt update sudo apt install tesseract-ocr tesseract-ocr-eng

macOS:

brew install tesseract

Windows:

Download the installer from the official UB-Mannheim repository on GitHub.

Verify the installation:

tesseract --version # tesseract 5.x.x

Then install the Python wrapper:

pip install pytesseract Pillow pdf2image

Try These Tools

Mentioned in this article — free, no sign-up required.

Image to Text (OCR)

Extract text from images and scanned PDFs

PDF to TXT

Convert PDF documents to Plain Text format

ocrtesseractpythonpdftext extractionimage processing

Related Articles

How to Convert Images to WebP Using Python and Pillow

7 min read

How to Build a RAG Chatbot with LangChain and OpenAI in 2026

4 min read

Systemd Service Files: Run Any App as a Linux Service

2 min read

Back to all articles

Basic Image to Text Extraction

The simplest use case — extract text from a single image:

import pytesseract
from PIL import Image

# Open image
img = Image.open("document.png")

# Extract text
text = pytesseract.image_to_string(img)
print(text)

For better accuracy, preprocess the image first:

import pytesseract
from PIL import Image, ImageFilter, ImageEnhance

def preprocess_for_ocr(image_path):
    img = Image.open(image_path)
    
    # Convert to grayscale
    img = img.convert("L")
    
    # Increase contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(2.0)
    
    # Sharpen
    img = img.filter(ImageFilter.SHARPEN)
    
    # Binarize (convert to pure black and white)
    img = img.point(lambda x: 0 if x < 128 else 255)
    
    return img

img = preprocess_for_ocr("receipt.jpg")
text = pytesseract.image_to_string(img)
print(text)

Preprocessing typically improves accuracy by 20-40% on photos and scanned documents.

Extracting Text from Scanned PDFs

Scanned PDFs are trickier — you need to convert each page to an image first, then run OCR:

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path):
    # Convert PDF pages to images (300 DPI for good quality)
    images = convert_from_path(pdf_path, dpi=300)
    
    full_text = []
    for i, page_image in enumerate(images):
        text = pytesseract.image_to_string(page_image)
        full_text.append(f"--- Page {i + 1} ---\n{text}")
    
    return "\n\n".join(full_text)

result = ocr_pdf("scanned-contract.pdf")
print(result)

Important: pdf2image requires poppler-utils to be installed:

# Ubuntu sudo apt install poppler-utils # macOS brew install poppler

Handling Multiple Languages

Tesseract supports 100+ languages. Install the language packs you need:

# Install German and French sudo apt install tesseract-ocr-deu tesseract-ocr-fra # List available languages tesseract --list-langs

Then specify the language in your code:

# Single language
text = pytesseract.image_to_string(img, lang="deu")

# Multiple languages (if document mixes languages)
text = pytesseract.image_to_string(img, lang="eng+deu+fra")

For best results with non-Latin scripts (Arabic, Chinese, Japanese), use the _best trained data files instead of the default _fast ones.

When to Use AI-Based OCR Instead

Tesseract struggles with:

Handwritten text
Low-quality photos with poor lighting
Complex layouts (tables, multi-column documents)
Curved or distorted text

For these cases, AI vision models are significantly better. Here's how to use GPT-4o-mini for OCR:

from openai import OpenAI
import base64

client = OpenAI()

def ai_ocr(image_path):
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
                {"type": "text", "text": "Extract all text from this image. Preserve formatting."}
            ]
        }],
        max_tokens=4000
    )
    return response.choices[0].message.content

text = ai_ocr("handwritten-notes.jpg")
print(text)

Cost: ~$0.001-0.005 per image with gpt-4o-mini. For free OCR without any setup, try Reformat's OCR tool — it handles images and scanned PDFs with a single upload.