Learn how to build a Retrieval-Augmented Generation chatbot that answers questions from your own documents using LangChain, OpenAI, and a vector database.
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with your own data. Instead of relying solely on the LLM's training data, RAG fetches relevant documents from a knowledge base and feeds them to the model as context.
This solves three critical problems with vanilla LLMs:
In this tutorial, you'll build a complete RAG chatbot that can answer questions about any PDF or text document you upload.
Before you begin, make sure you have:
Install the required packages:
pip install langchain langchain-openai chromadb pypdf
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"
First, load your PDF documents and split them into chunks. Chunking is essential because LLMs have limited context windows, and smaller chunks produce more precise retrieval.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load a PDF
loader = PyPDFLoader("your-document.pdf")
pages = loader.load()
# Split into chunks of ~500 tokens with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(pages)
print(f"Created {len(chunks)} chunks from {len(pages)} pages")
The chunk_overlap parameter ensures that sentences split across chunk boundaries are still captured. A value of 200 characters works well for most documents.
Mentioned in this article — free, no sign-up required.
Next, convert your text chunks into vector embeddings and store them in ChromaDB, a lightweight vector database that runs locally.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Create embeddings using OpenAI's text-embedding-3-small model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Store in ChromaDB (persists to disk)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Stored {len(chunks)} vectors in ChromaDB")
The embedding model converts each text chunk into a 1536-dimensional vector. When you ask a question, it converts your question into a vector too, then finds the closest matching chunks using cosine similarity.
Now connect everything into a retrieval chain that fetches relevant context and generates an answer:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
# Create a retriever that fetches the top 4 most relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Custom prompt template
prompt = PromptTemplate(
template="""Use the following context to answer the question. If the answer is not in the context, say "I don't have enough information to answer that."
Context: {context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
# Build the chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt}
)
# Ask a question
result = qa_chain.invoke("What are the key findings in this document?")
print(result["result"])
To make it a proper chatbot that remembers previous messages, add conversation memory:
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
k=5 # Remember last 5 exchanges
)
chat_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
)
# First question
result = chat_chain.invoke({"question": "What is this document about?"})
print(result["answer"])
# Follow-up (it remembers the context)
result = chat_chain.invoke({"question": "Can you elaborate on the second point?"})
print(result["answer"])
The ConversationBufferWindowMemory keeps the last 5 exchanges in memory, which is enough for most conversations without exceeding token limits.
RAG with OpenAI can get expensive if not managed well. Here are practical tips:
With these optimizations, a typical RAG application costs $0.001-0.005 per query — roughly $1 for 200-1000 queries.
You've built a complete RAG chatbot that can answer questions from any document. The key components are:
To try document Q&A without building anything, use Reformat's Chat with Document tool — upload any PDF and start asking questions instantly.
Next steps: Add a web UI with Streamlit or Gradio, support multiple file formats, or deploy as an API with FastAPI.