AI/ML

OCR in Python: Extract Text from Images and Scanned PDFs

A practical guide to extracting text from images and scanned PDF documents using Python, Tesseract OCR, and modern AI-based alternatives.

March 19, 20263 min read

When Do You Need OCR?

Optical Character Recognition (OCR) converts images of text into machine-readable text. You need it when:

  • You have scanned PDFs that are just images wrapped in a PDF container
  • You need to extract text from photos of documents, receipts, or whiteboards
  • You're building a document processing pipeline that handles both digital and scanned files
  • You want to make scanned documents searchable

The two main approaches in 2026 are:

  • 1. Tesseract OCR — free, open-source, runs locally, good for clean documents
  • 2. AI-based OCR (GPT-4o Vision, Google Cloud Vision) — better accuracy on messy handwriting, costs per request

Setting Up Tesseract on Your System

Tesseract is the most widely used open-source OCR engine. Install it first:

Ubuntu/Debian:
sudo apt update

sudo apt install tesseract-ocr tesseract-ocr-eng

macOS:
brew install tesseract

Windows:

Download the installer from the official UB-Mannheim repository on GitHub.

Verify the installation:

tesseract --version

# tesseract 5.x.x

Then install the Python wrapper:

pip install pytesseract Pillow pdf2image

Basic Image to Text Extraction

The simplest use case — extract text from a single image:

import pytesseract

from PIL import Image

# Open image

img = Image.open("document.png")

# Extract text

text = pytesseract.image_to_string(img)

print(text)

For better accuracy, preprocess the image first:

import pytesseract

from PIL import Image, ImageFilter, ImageEnhance

def preprocess_for_ocr(image_path):

img = Image.open(image_path)

# Convert to grayscale

img = img.convert("L")

# Increase contrast

enhancer = ImageEnhance.Contrast(img)

img = enhancer.enhance(2.0)

# Sharpen

img = img.filter(ImageFilter.SHARPEN)

# Binarize (convert to pure black and white)

img = img.point(lambda x: 0 if x < 128 else 255)

return img

img = preprocess_for_ocr("receipt.jpg")

text = pytesseract.image_to_string(img)

print(text)

Preprocessing typically improves accuracy by 20-40% on photos and scanned documents.

Extracting Text from Scanned PDFs

Scanned PDFs are trickier — you need to convert each page to an image first, then run OCR:

from pdf2image import convert_from_path

import pytesseract

def ocr_pdf(pdf_path):

# Convert PDF pages to images (300 DPI for good quality)

images = convert_from_path(pdf_path, dpi=300)

full_text = []

for i, page_image in enumerate(images):

text = pytesseract.image_to_string(page_image)

full_text.append(f"--- Page {i + 1} ---\n{text}")

return "\n\n".join(full_text)

result = ocr_pdf("scanned-contract.pdf")

print(result)

Important: pdf2image requires poppler-utils to be installed:
# Ubuntu

sudo apt install poppler-utils

# macOS

brew install poppler

Handling Multiple Languages

Tesseract supports 100+ languages. Install the language packs you need:

# Install German and French

sudo apt install tesseract-ocr-deu tesseract-ocr-fra

# List available languages

tesseract --list-langs

Then specify the language in your code:

# Single language

text = pytesseract.image_to_string(img, lang="deu")

# Multiple languages (if document mixes languages)

text = pytesseract.image_to_string(img, lang="eng+deu+fra")

For best results with non-Latin scripts (Arabic, Chinese, Japanese), use the _best trained data files instead of the default _fast ones.

When to Use AI-Based OCR Instead

Tesseract struggles with:

  • Handwritten text
  • Low-quality photos with poor lighting
  • Complex layouts (tables, multi-column documents)
  • Curved or distorted text

For these cases, AI vision models are significantly better. Here's how to use GPT-4o-mini for OCR:

from openai import OpenAI

import base64

client = OpenAI()

def ai_ocr(image_path):

with open(image_path, "rb") as f:

b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[{

"role": "user",

"content": [

{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},

{"type": "text", "text": "Extract all text from this image. Preserve formatting."}

]

}],

max_tokens=4000

)

return response.choices[0].message.content

text = ai_ocr("handwritten-notes.jpg")

print(text)

Cost: ~$0.001-0.005 per image with gpt-4o-mini. For free OCR without any setup, try Reformat's OCR tool — it handles images and scanned PDFs with a single upload.

Conclusion

Use Tesseract when:
  • Documents are cleanly printed
  • You need offline/local processing
  • Budget is zero

Use AI OCR when:
  • Documents have handwriting, tables, or complex layouts
  • Accuracy is critical
  • You're processing small volumes

For a quick one-off extraction without installing anything, Reformat's Image to Text (OCR) tool handles both approaches — just upload your file and get the text instantly.

Try These Tools

Mentioned in this article — free, no sign-up required.

ocrtesseractpythonpdftext extractionimage processing

Related Articles