Why Fine-Tune Instead of Using Prompts?
Prompt engineering works well for general tasks, but fine-tuning is better when:
- You need consistent output format (always returning JSON in a specific schema)
- You have a domain-specific style (legal writing, medical notes, brand voice)
- You want to reduce token usage — fine-tuned models need shorter prompts
- You need better accuracy on niche tasks the base model struggles with
Fine-tuning GPT-4o-mini costs $3.00 per million training tokens — roughly $0.30 to train on 100 examples. It's surprisingly affordable for the improvement you get.
Step 1 — Prepare Your Training Data
OpenAI requires training data in JSONL format. Each line is a conversation with messages:
{"messages": [{"role": "system", "content": "You are a legal document summarizer."}, {"role": "user", "content": "Summarize this contract clause: [clause text]"}, {"role": "assistant", "content": "This clause establishes..."}]}
{"messages": [{"role": "system", "content": "You are a legal document summarizer."}, {"role": "user", "content": "Summarize this contract clause: [another clause]"}, {"role": "assistant", "content": "This provision requires..."}]}
Key guidelines:
- Minimum 10 examples, recommended 50-100 for noticeable improvement
- Each example should represent the exact input/output you want
- Keep the system message consistent across all examples
- Include edge cases and diverse examples
Save this as training_data.jsonl.
Step 2 — Validate and Upload
Validate your data format before uploading:
import json
def validate_training_data(filepath):
with open(filepath) as f:
lines = f.readlines()
errors = []
for i, line in enumerate(lines):
try:
data = json.loads(line)
if "messages" not in data:
errors.append(f"Line {i+1}: missing 'messages' key")
elif len(data["messages"]) < 2:
errors.append(f"Line {i+1}: need at least user + assistant messages")
except json.JSONDecodeError:
errors.append(f"Line {i+1}: invalid JSON")
if errors:
for e in errors:
print(f"ERROR: {e}")
else:
print(f"Valid! {len(lines)} training examples.")
validate_training_data("training_data.jsonl")
Upload to OpenAI:
from openai import OpenAI
client = OpenAI()
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
print(f"File ID: {file.id}")
Step 3 — Start Fine-Tuning
Create a fine-tuning job:
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3, # 2-4 epochs is usually optimal
}
)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
Monitor progress:
import time
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
if job.status in ["succeeded", "failed", "cancelled"]:
break
time.sleep(60)
if job.status == "succeeded":
print(f"Model: {job.fine_tuned_model}")
Training typically takes 10-30 minutes for 50-100 examples.
Step 4 — Use Your Fine-Tuned Model
Once training completes, use your model like any other OpenAI model:
response = client.chat.completions.create(
model=job.fine_tuned_model, # ft:gpt-4o-mini-2024-07-18:your-org::xxx
messages=[
{"role": "system", "content": "You are a legal document summarizer."},
{"role": "user", "content": "Summarize this clause: ..."}
]
)
print(response.choices[0].message.content)
Fine-tuned gpt-4o-mini costs $0.30 per million input tokens and $1.20 per million output tokens — about 2x the base model price, but you'll typically use fewer tokens because you need shorter prompts.
Tips for Better Results
After fine-tuning dozens of models, here's what actually makes a difference:
- 1. Quality > quantity — 50 perfect examples beat 500 sloppy ones
- 2. Be consistent — same system prompt, same output format in every example
- 3. Include negative examples — show the model what NOT to do
- 4. Start with 3 epochs — more epochs can cause overfitting on small datasets
- 5. Test against base model — run the same 20 test cases on both models to measure improvement
- 6. Iterate — add examples where the fine-tuned model fails, then retrain
Fine-tuning is not a one-shot process. The best models go through 3-5 iterations of training data refinement.