AI/ML

AI Document Summarization: Techniques, Tools, and Best Practices

A comprehensive guide to AI-powered document summarization — from extractive vs abstractive approaches to building your own summarizer with Python.

March 17, 20263 min read

Two Approaches to Summarization

AI summarization falls into two categories:

Extractive summarization pulls the most important sentences directly from the original text. It doesn't generate new words — it selects and concatenates existing sentences. Think of it as highlighting a textbook. Abstractive summarization generates new text that captures the meaning of the original. It can paraphrase, combine ideas, and produce text that wasn't in the source. This is what GPT-4 and Claude do.
FeatureExtractiveAbstractive
AccuracyHigh (uses original text)Can hallucinate
ReadabilityChoppy (stitched sentences)Natural and fluent
SpeedFastSlower (needs LLM)
CostFree (runs locally)Costs per request

In practice, the best systems combine both: extract key sections first, then use an LLM to synthesize them into a coherent summary.

Building an Extractive Summarizer with Python

You can build a surprisingly good extractive summarizer with just nltk and basic statistics:

import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

from collections import Counter

nltk.download('punkt_tab')

nltk.download('stopwords')

def extractive_summary(text, num_sentences=5):

sentences = sent_tokenize(text)

words = word_tokenize(text.lower())

# Remove stopwords

stop = set(stopwords.words('english'))

words = [w for w in words if w.isalnum() and w not in stop]

# Score sentences by word frequency

freq = Counter(words)

scored = []

for sent in sentences:

sent_words = word_tokenize(sent.lower())

score = sum(freq[w] for w in sent_words if w in freq)

scored.append((score, sent))

# Return top sentences in original order

top = sorted(scored, reverse=True)[:num_sentences]

top_sents = {s for _, s in top}

return ' '.join(s for s in sentences if s in top_sents)

# Usage

with open('document.txt') as f:

text = f.read()

print(extractive_summary(text, num_sentences=3))

This approach works well for news articles, reports, and structured documents. It costs nothing and runs in milliseconds.

Abstractive Summarization with OpenAI

For higher quality summaries, use an LLM:

from openai import OpenAI

client = OpenAI()

def summarize(text, length="medium"):

length_map = {

"brief": "Provide a 2-3 sentence summary.",

"medium": "Provide a summary in 1-2 paragraphs.",

"detailed": "Provide a detailed summary with bullet points for key takeaways."

}

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[

{"role": "system", "content": "You are a professional document summarizer. Be concise and accurate."},

{"role": "user", "content": f"{length_map[length]}\n\nDocument:\n{text[:40000]}"}

],

max_tokens=1000,

temperature=0.3

)

return response.choices[0].message.content

summary = summarize(text, length="brief")

print(summary)

The temperature=0.3 keeps the output focused and factual. Higher temperatures produce more creative but potentially less accurate summaries.

Handling Long Documents

LLMs have token limits. For documents longer than the context window, use a map-reduce approach:

  • 1. Map — summarize each chunk independently
  • 2. Reduce — combine chunk summaries into a final summary

def summarize_long_document(text, chunk_size=30000):

# Split into chunks

chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

if len(chunks) == 1:

return summarize(chunks[0])

# Map: summarize each chunk

chunk_summaries = []

for i, chunk in enumerate(chunks):

summary = summarize(chunk, length="medium")

chunk_summaries.append(summary)

print(f"Summarized chunk {i+1}/{len(chunks)}")

# Reduce: combine summaries

combined = "\n\n".join(chunk_summaries)

return summarize(combined, length="detailed")

This handles documents of any length — books, legal contracts, research papers — by breaking them into manageable pieces.

Try It Without Code

If you don't want to set up a development environment, Reformat's AI Document Summarizer lets you upload any PDF, Word, or text file and get an instant summary. Choose between brief, medium, and detailed summaries — no API keys or coding required.

The tool uses GPT-4o-mini under the hood with optimized prompts for different document types, and it's free for 2 uses per day.

Try These Tools

Mentioned in this article — free, no sign-up required.

summarizationnlpaidocument processingpythontransformers

Related Articles