A comprehensive guide to AI-powered document summarization — from extractive vs abstractive approaches to building your own summarizer with Python.
AI summarization falls into two categories:
Extractive summarization pulls the most important sentences directly from the original text. It doesn't generate new words — it selects and concatenates existing sentences. Think of it as highlighting a textbook. Abstractive summarization generates new text that captures the meaning of the original. It can paraphrase, combine ideas, and produce text that wasn't in the source. This is what GPT-4 and Claude do.| Feature | Extractive | Abstractive |
|---|---|---|
| Accuracy | High (uses original text) | Can hallucinate |
| Readability | Choppy (stitched sentences) | Natural and fluent |
| Speed | Fast | Slower (needs LLM) |
| Cost | Free (runs locally) | Costs per request |
In practice, the best systems combine both: extract key sections first, then use an LLM to synthesize them into a coherent summary.
You can build a surprisingly good extractive summarizer with just nltk and basic statistics:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
nltk.download('punkt_tab')
nltk.download('stopwords')
def extractive_summary(text, num_sentences=5):
sentences = sent_tokenize(text)
words = word_tokenize(text.lower())
# Remove stopwords
stop = set(stopwords.words('english'))
words = [w for w in words if w.isalnum() and w not in stop]
# Score sentences by word frequency
freq = Counter(words)
scored = []
for sent in sentences:
sent_words = word_tokenize(sent.lower())
score = sum(freq[w] for w in sent_words if w in freq)
scored.append((score, sent))
# Return top sentences in original order
top = sorted(scored, reverse=True)[:num_sentences]
top_sents = {s for _, s in top}
return ' '.join(s for s in sentences if s in top_sents)
# Usage
with open('document.txt') as f:
text = f.read()
print(extractive_summary(text, num_sentences=3))
This approach works well for news articles, reports, and structured documents. It costs nothing and runs in milliseconds.
Mentioned in this article — free, no sign-up required.
For higher quality summaries, use an LLM:
from openai import OpenAI
client = OpenAI()
def summarize(text, length="medium"):
length_map = {
"brief": "Provide a 2-3 sentence summary.",
"medium": "Provide a summary in 1-2 paragraphs.",
"detailed": "Provide a detailed summary with bullet points for key takeaways."
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a professional document summarizer. Be concise and accurate."},
{"role": "user", "content": f"{length_map[length]}\n\nDocument:\n{text[:40000]}"}
],
max_tokens=1000,
temperature=0.3
)
return response.choices[0].message.content
summary = summarize(text, length="brief")
print(summary)
The temperature=0.3 keeps the output focused and factual. Higher temperatures produce more creative but potentially less accurate summaries.
LLMs have token limits. For documents longer than the context window, use a map-reduce approach:
def summarize_long_document(text, chunk_size=30000):
# Split into chunks
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
if len(chunks) == 1:
return summarize(chunks[0])
# Map: summarize each chunk
chunk_summaries = []
for i, chunk in enumerate(chunks):
summary = summarize(chunk, length="medium")
chunk_summaries.append(summary)
print(f"Summarized chunk {i+1}/{len(chunks)}")
# Reduce: combine summaries
combined = "\n\n".join(chunk_summaries)
return summarize(combined, length="detailed")
This handles documents of any length — books, legal contracts, research papers — by breaking them into manageable pieces.
If you don't want to set up a development environment, Reformat's AI Document Summarizer lets you upload any PDF, Word, or text file and get an instant summary. Choose between brief, medium, and detailed summaries — no API keys or coding required.
The tool uses GPT-4o-mini under the hood with optimized prompts for different document types, and it's free for 2 uses per day.