Two Approaches to Summarization
AI summarization falls into two categories:
Extractive summarization pulls the most important sentences directly from the original text. It doesn't generate new words — it selects and concatenates existing sentences. Think of it as highlighting a textbook. Abstractive summarization generates new text that captures the meaning of the original. It can paraphrase, combine ideas, and produce text that wasn't in the source. This is what GPT-4 and Claude do.| Feature | Extractive | Abstractive |
|---|---|---|
| Accuracy | High (uses original text) | Can hallucinate |
| Readability | Choppy (stitched sentences) | Natural and fluent |
| Speed | Fast | Slower (needs LLM) |
| Cost | Free (runs locally) | Costs per request |
In practice, the best systems combine both: extract key sections first, then use an LLM to synthesize them into a coherent summary.
Building an Extractive Summarizer with Python
You can build a surprisingly good extractive summarizer with just nltk and basic statistics:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
nltk.download('punkt_tab')
nltk.download('stopwords')
def extractive_summary(text, num_sentences=5):
sentences = sent_tokenize(text)
words = word_tokenize(text.lower())
# Remove stopwords
stop = set(stopwords.words('english'))
words = [w for w in words if w.isalnum() and w not in stop]
# Score sentences by word frequency
freq = Counter(words)
scored = []
for sent in sentences:
sent_words = word_tokenize(sent.lower())
score = sum(freq[w] for w in sent_words if w in freq)
scored.append((score, sent))
# Return top sentences in original order
top = sorted(scored, reverse=True)[:num_sentences]
top_sents = {s for _, s in top}
return ' '.join(s for s in sentences if s in top_sents)
# Usage
with open('document.txt') as f:
text = f.read()
print(extractive_summary(text, num_sentences=3))
This approach works well for news articles, reports, and structured documents. It costs nothing and runs in milliseconds.
Abstractive Summarization with OpenAI
For higher quality summaries, use an LLM:
from openai import OpenAI
client = OpenAI()
def summarize(text, length="medium"):
length_map = {
"brief": "Provide a 2-3 sentence summary.",
"medium": "Provide a summary in 1-2 paragraphs.",
"detailed": "Provide a detailed summary with bullet points for key takeaways."
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a professional document summarizer. Be concise and accurate."},
{"role": "user", "content": f"{length_map[length]}\n\nDocument:\n{text[:40000]}"}
],
max_tokens=1000,
temperature=0.3
)
return response.choices[0].message.content
summary = summarize(text, length="brief")
print(summary)
The temperature=0.3 keeps the output focused and factual. Higher temperatures produce more creative but potentially less accurate summaries.
Handling Long Documents
LLMs have token limits. For documents longer than the context window, use a map-reduce approach:
- 1. Map — summarize each chunk independently
- 2. Reduce — combine chunk summaries into a final summary
def summarize_long_document(text, chunk_size=30000):
# Split into chunks
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
if len(chunks) == 1:
return summarize(chunks[0])
# Map: summarize each chunk
chunk_summaries = []
for i, chunk in enumerate(chunks):
summary = summarize(chunk, length="medium")
chunk_summaries.append(summary)
print(f"Summarized chunk {i+1}/{len(chunks)}")
# Reduce: combine summaries
combined = "\n\n".join(chunk_summaries)
return summarize(combined, length="detailed")
This handles documents of any length — books, legal contracts, research papers — by breaking them into manageable pieces.
Try It Without Code
If you don't want to set up a development environment, Reformat's AI Document Summarizer lets you upload any PDF, Word, or text file and get an instant summary. Choose between brief, medium, and detailed summaries — no API keys or coding required.
The tool uses GPT-4o-mini under the hood with optimized prompts for different document types, and it's free for 2 uses per day.