What Is the Vision API?
OpenAI's GPT-4o model can understand images alongside text. You send an image (as a URL or base64-encoded data) and ask the model to describe, analyze, or extract information from it.
Common use cases:
- Image captioning for accessibility (alt text)
- Content moderation — detecting inappropriate content
- Data extraction — reading receipts, invoices, or charts
- Visual Q&A — asking questions about what is in an image
In this tutorial, we will focus on generating descriptive captions for images.
Setup and Authentication
Install the OpenAI Python SDK:
pip install openai
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"
Or create a .env file and load it with python-dotenv:
pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()
Captioning a Single Image
Here is how to caption an image from a URL:
from openai import OpenAI
client = OpenAI()
def caption_image(image_url, style="descriptive"):
prompts = {
"descriptive": "Describe this image in one detailed sentence suitable for an alt text attribute.",
"concise": "Write a short, 5-10 word caption for this image.",
"social": "Write an engaging social media caption for this image.",
}
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompts.get(style, prompts["descriptive"])},
{"type": "image_url", "image_url": {"url": image_url}},
],
}
],
max_tokens=150,
)
return response.choices[0].message.content
# Usage
caption = caption_image("https://example.com/sunset.jpg")
print(caption)
# Output: "A vibrant sunset over a calm ocean with orange and purple hues reflecting off gentle waves."
Captioning Local Files
For local images, encode them as base64:
import base64
from pathlib import Path
def caption_local_image(file_path, style="descriptive"):
image_data = Path(file_path).read_bytes()
base64_image = base64.b64encode(image_data).decode("utf-8")
# Detect MIME type
ext = Path(file_path).suffix.lower()
mime_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".webp": "image/webp", ".gif": "image/gif"}
mime = mime_types.get(ext, "image/jpeg")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in one detailed sentence."},
{"type": "image_url", "image_url": {"url": f"data:{mime};base64,{base64_image}"}},
],
}
],
max_tokens=150,
)
return response.choices[0].message.content
print(caption_local_image("./photos/beach.jpg"))
Batch Processing Multiple Images
Process an entire folder of images and save captions to a JSON file:
import json
import os
from pathlib import Path
def batch_caption(folder, output_file="captions.json", style="descriptive"):
supported = {".jpg", ".jpeg", ".png", ".webp", ".gif"}
results = []
images = [f for f in Path(folder).iterdir() if f.suffix.lower() in supported]
print(f"Found {len(images)} images to caption")
for i, img_path in enumerate(images, 1):
print(f" [{i}/{len(images)}] {img_path.name}...", end=" ")
try:
caption = caption_local_image(str(img_path), style)
results.append({"file": img_path.name, "caption": caption})
print("Done")
except Exception as e:
results.append({"file": img_path.name, "error": str(e)})
print(f"Error: {e}")
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"\nSaved {len(results)} captions to {output_file}")
batch_caption("./product-images")
This generates a clean JSON file you can use for alt text, image SEO, or content management.
Cost and Rate Limits
Important considerations for production use:
| Detail | Value |
|---|---|
| Model | gpt-4o |
| Cost per image | ~$0.01-0.03 (depends on resolution) |
| Rate limit | 500 RPM (tier 1) |
| Max image size | 20 MB |
| Supported formats | JPG, PNG, WebP, GIF |
To reduce costs:
- Use
detail: "low"in the image_url object for simple captioning tasks - Resize large images before sending (1024px width is sufficient)
- Cache results to avoid re-captioning the same images
FAQ
GPT-4o produces highly accurate and detailed descriptions. For accessibility alt text, the quality is typically better than human-written captions because the model describes what it actually sees.
Can I caption images in languages other than English?Yes, add the target language to your prompt: "Describe this image in one sentence in Spanish."
Is there a free alternative?Reformat offers a free AI Image Caption tool that generates captions without needing an API key. It is great for quick, one-off captioning tasks.