What is RAG and Why Does It Matter?
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with your own data. Instead of relying solely on the LLM's training data, RAG fetches relevant documents from a knowledge base and feeds them to the model as context.
This solves three critical problems with vanilla LLMs:
- Hallucination — the model invents facts. RAG grounds answers in real documents.
- Stale knowledge — LLMs have a training cutoff. RAG uses your latest data.
- Domain specificity — your company docs, policies, and data aren't in the training set.
In this tutorial, you'll build a complete RAG chatbot that can answer questions about any PDF or text document you upload.
Prerequisites
Before you begin, make sure you have:
- Python 3.10 or higher installed
- An OpenAI API key (sign up at platform.openai.com)
- Basic familiarity with Python and the command line
Install the required packages:
pip install langchain langchain-openai chromadb pypdf
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"
Step 1 — Load and Split Your Documents
First, load your PDF documents and split them into chunks. Chunking is essential because LLMs have limited context windows, and smaller chunks produce more precise retrieval.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load a PDF
loader = PyPDFLoader("your-document.pdf")
pages = loader.load()
# Split into chunks of ~500 tokens with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(pages)
print(f"Created {len(chunks)} chunks from {len(pages)} pages")
The chunk_overlap parameter ensures that sentences split across chunk boundaries are still captured. A value of 200 characters works well for most documents.
Step 2 — Create a Vector Store with ChromaDB
Next, convert your text chunks into vector embeddings and store them in ChromaDB, a lightweight vector database that runs locally.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Create embeddings using OpenAI's text-embedding-3-small model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Store in ChromaDB (persists to disk)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Stored {len(chunks)} vectors in ChromaDB")
The embedding model converts each text chunk into a 1536-dimensional vector. When you ask a question, it converts your question into a vector too, then finds the closest matching chunks using cosine similarity.
Step 3 — Build the RAG Chain
Now connect everything into a retrieval chain that fetches relevant context and generates an answer:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
# Create a retriever that fetches the top 4 most relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Custom prompt template
prompt = PromptTemplate(
template="""Use the following context to answer the question. If the answer is not in the context, say "I don't have enough information to answer that."
Context: {context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
# Build the chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt}
)
# Ask a question
result = qa_chain.invoke("What are the key findings in this document?")
print(result["result"])
Step 4 — Add Conversation Memory
To make it a proper chatbot that remembers previous messages, add conversation memory:
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
k=5 # Remember last 5 exchanges
)
chat_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
)
# First question
result = chat_chain.invoke({"question": "What is this document about?"})
print(result["answer"])
# Follow-up (it remembers the context)
result = chat_chain.invoke({"question": "Can you elaborate on the second point?"})
print(result["answer"])
The ConversationBufferWindowMemory keeps the last 5 exchanges in memory, which is enough for most conversations without exceeding token limits.
Cost Optimization Tips
RAG with OpenAI can get expensive if not managed well. Here are practical tips:
- 1. Use text-embedding-3-small instead of ada-002 — it's 5x cheaper and performs better.
- 2. Use gpt-4o-mini for most queries — it's 15x cheaper than gpt-4o and handles RAG well.
- 3. Cache embeddings — don't re-embed documents that haven't changed.
- 4. Limit retrieved chunks — 3-4 chunks is usually enough. More ≠ better.
- 5. Truncate long chunks — if a chunk exceeds 500 tokens, the signal-to-noise ratio drops.
With these optimizations, a typical RAG application costs $0.001-0.005 per query — roughly $1 for 200-1000 queries.
Conclusion
You've built a complete RAG chatbot that can answer questions from any document. The key components are:
- Document loading and chunking — split your docs into searchable pieces
- Vector embeddings — convert text to numbers for similarity search
- Retrieval — find the most relevant chunks for each question
- Generation — use an LLM to synthesize a natural language answer
To try document Q&A without building anything, use Reformat's Chat with Document tool — upload any PDF and start asking questions instantly.
Next steps: Add a web UI with Streamlit or Gradio, support multiple file formats, or deploy as an API with FastAPI.