Python

How to Use OpenAI's Vision API to Caption Images Automatically

Learn to build an automatic image captioning system using OpenAI's GPT-4o Vision API. Covers single images, batch processing, custom prompts, and integrating captions into your workflow.

March 17, 20268 min read

What Is the Vision API?

OpenAI's GPT-4o model can understand images alongside text. You send an image (as a URL or base64-encoded data) and ask the model to describe, analyze, or extract information from it.

Common use cases:

  • Image captioning for accessibility (alt text)
  • Content moderation — detecting inappropriate content
  • Data extraction — reading receipts, invoices, or charts
  • Visual Q&A — asking questions about what is in an image

In this tutorial, we will focus on generating descriptive captions for images.

Setup and Authentication

Install the OpenAI Python SDK:

pip install openai

Set your API key as an environment variable:

export OPENAI_API_KEY="sk-your-key-here"

Or create a .env file and load it with python-dotenv:

pip install python-dotenv

from dotenv import load_dotenv

load_dotenv()

Captioning a Single Image

Here is how to caption an image from a URL:

from openai import OpenAI

client = OpenAI()

def caption_image(image_url, style="descriptive"):

prompts = {

"descriptive": "Describe this image in one detailed sentence suitable for an alt text attribute.",

"concise": "Write a short, 5-10 word caption for this image.",

"social": "Write an engaging social media caption for this image.",

}

response = client.chat.completions.create(

model="gpt-4o",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": prompts.get(style, prompts["descriptive"])},

{"type": "image_url", "image_url": {"url": image_url}},

],

}

],

max_tokens=150,

)

return response.choices[0].message.content

# Usage

caption = caption_image("https://example.com/sunset.jpg")

print(caption)

# Output: "A vibrant sunset over a calm ocean with orange and purple hues reflecting off gentle waves."

Captioning Local Files

For local images, encode them as base64:

import base64

from pathlib import Path

def caption_local_image(file_path, style="descriptive"):

image_data = Path(file_path).read_bytes()

base64_image = base64.b64encode(image_data).decode("utf-8")

# Detect MIME type

ext = Path(file_path).suffix.lower()

mime_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".webp": "image/webp", ".gif": "image/gif"}

mime = mime_types.get(ext, "image/jpeg")

response = client.chat.completions.create(

model="gpt-4o",

messages=[

{

"role": "user",

"content": [

{"type": "text", "text": "Describe this image in one detailed sentence."},

{"type": "image_url", "image_url": {"url": f"data:{mime};base64,{base64_image}"}},

],

}

],

max_tokens=150,

)

return response.choices[0].message.content

print(caption_local_image("./photos/beach.jpg"))

Batch Processing Multiple Images

Process an entire folder of images and save captions to a JSON file:

import json

import os

from pathlib import Path

def batch_caption(folder, output_file="captions.json", style="descriptive"):

supported = {".jpg", ".jpeg", ".png", ".webp", ".gif"}

results = []

images = [f for f in Path(folder).iterdir() if f.suffix.lower() in supported]

print(f"Found {len(images)} images to caption")

for i, img_path in enumerate(images, 1):

print(f" [{i}/{len(images)}] {img_path.name}...", end=" ")

try:

caption = caption_local_image(str(img_path), style)

results.append({"file": img_path.name, "caption": caption})

print("Done")

except Exception as e:

results.append({"file": img_path.name, "error": str(e)})

print(f"Error: {e}")

with open(output_file, "w") as f:

json.dump(results, f, indent=2)

print(f"\nSaved {len(results)} captions to {output_file}")

batch_caption("./product-images")

This generates a clean JSON file you can use for alt text, image SEO, or content management.

Cost and Rate Limits

Important considerations for production use:

DetailValue
Modelgpt-4o
Cost per image~$0.01-0.03 (depends on resolution)
Rate limit500 RPM (tier 1)
Max image size20 MB
Supported formatsJPG, PNG, WebP, GIF

To reduce costs:

  • Use detail: "low" in the image_url object for simple captioning tasks
  • Resize large images before sending (1024px width is sufficient)
  • Cache results to avoid re-captioning the same images

FAQ

How accurate are AI-generated captions?

GPT-4o produces highly accurate and detailed descriptions. For accessibility alt text, the quality is typically better than human-written captions because the model describes what it actually sees.

Can I caption images in languages other than English?

Yes, add the target language to your prompt: "Describe this image in one sentence in Spanish."

Is there a free alternative?

Reformat offers a free AI Image Caption tool that generates captions without needing an API key. It is great for quick, one-off captioning tasks.

openaivision-apiimage-captioningaipython

Related Tutorials