When Do You Need OCR?
Optical Character Recognition (OCR) converts images of text into machine-readable text. You need it when:
- You have scanned PDFs that are just images wrapped in a PDF container
- You need to extract text from photos of documents, receipts, or whiteboards
- You're building a document processing pipeline that handles both digital and scanned files
- You want to make scanned documents searchable
The two main approaches in 2026 are:
- 1. Tesseract OCR — free, open-source, runs locally, good for clean documents
- 2. AI-based OCR (GPT-4o Vision, Google Cloud Vision) — better accuracy on messy handwriting, costs per request
Setting Up Tesseract on Your System
Tesseract is the most widely used open-source OCR engine. Install it first:
Ubuntu/Debian:sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng
macOS:
brew install tesseract
Windows:
Download the installer from the official UB-Mannheim repository on GitHub.
Verify the installation:
tesseract --version
# tesseract 5.x.x
Then install the Python wrapper:
pip install pytesseract Pillow pdf2image
Basic Image to Text Extraction
The simplest use case — extract text from a single image:
import pytesseract
from PIL import Image
# Open image
img = Image.open("document.png")
# Extract text
text = pytesseract.image_to_string(img)
print(text)
For better accuracy, preprocess the image first:
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance
def preprocess_for_ocr(image_path):
img = Image.open(image_path)
# Convert to grayscale
img = img.convert("L")
# Increase contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2.0)
# Sharpen
img = img.filter(ImageFilter.SHARPEN)
# Binarize (convert to pure black and white)
img = img.point(lambda x: 0 if x < 128 else 255)
return img
img = preprocess_for_ocr("receipt.jpg")
text = pytesseract.image_to_string(img)
print(text)
Preprocessing typically improves accuracy by 20-40% on photos and scanned documents.
Extracting Text from Scanned PDFs
Scanned PDFs are trickier — you need to convert each page to an image first, then run OCR:
from pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path):
# Convert PDF pages to images (300 DPI for good quality)
images = convert_from_path(pdf_path, dpi=300)
full_text = []
for i, page_image in enumerate(images):
text = pytesseract.image_to_string(page_image)
full_text.append(f"--- Page {i + 1} ---\n{text}")
return "\n\n".join(full_text)
result = ocr_pdf("scanned-contract.pdf")
print(result)
Important: pdf2image requires poppler-utils to be installed:
# Ubuntu
sudo apt install poppler-utils
# macOS
brew install poppler
Handling Multiple Languages
Tesseract supports 100+ languages. Install the language packs you need:
# Install German and French
sudo apt install tesseract-ocr-deu tesseract-ocr-fra
# List available languages
tesseract --list-langs
Then specify the language in your code:
# Single language
text = pytesseract.image_to_string(img, lang="deu")
# Multiple languages (if document mixes languages)
text = pytesseract.image_to_string(img, lang="eng+deu+fra")
For best results with non-Latin scripts (Arabic, Chinese, Japanese), use the _best trained data files instead of the default _fast ones.
When to Use AI-Based OCR Instead
Tesseract struggles with:
- Handwritten text
- Low-quality photos with poor lighting
- Complex layouts (tables, multi-column documents)
- Curved or distorted text
For these cases, AI vision models are significantly better. Here's how to use GPT-4o-mini for OCR:
from openai import OpenAI
import base64
client = OpenAI()
def ai_ocr(image_path):
with open(image_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": "Extract all text from this image. Preserve formatting."}
]
}],
max_tokens=4000
)
return response.choices[0].message.content
text = ai_ocr("handwritten-notes.jpg")
print(text)
Cost: ~$0.001-0.005 per image with gpt-4o-mini. For free OCR without any setup, try Reformat's OCR tool — it handles images and scanned PDFs with a single upload.
Conclusion
- Documents are cleanly printed
- You need offline/local processing
- Budget is zero
- Documents have handwriting, tables, or complex layouts
- Accuracy is critical
- You're processing small volumes
For a quick one-off extraction without installing anything, Reformat's Image to Text (OCR) tool handles both approaches — just upload your file and get the text instantly.