AI/ML

How to Build a RAG Chatbot with LangChain and OpenAI in 2026

Learn how to build a Retrieval-Augmented Generation chatbot that answers questions from your own documents using LangChain, OpenAI, and a vector database.

March 20, 20264 min read

What is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with your own data. Instead of relying solely on the LLM's training data, RAG fetches relevant documents from a knowledge base and feeds them to the model as context.

This solves three critical problems with vanilla LLMs:

  • Hallucination — the model invents facts. RAG grounds answers in real documents.
  • Stale knowledge — LLMs have a training cutoff. RAG uses your latest data.
  • Domain specificity — your company docs, policies, and data aren't in the training set.

In this tutorial, you'll build a complete RAG chatbot that can answer questions about any PDF or text document you upload.

Prerequisites

Before you begin, make sure you have:

  • Python 3.10 or higher installed
  • An OpenAI API key (sign up at platform.openai.com)
  • Basic familiarity with Python and the command line

Install the required packages:

pip install langchain langchain-openai chromadb pypdf

Set your API key as an environment variable:

export OPENAI_API_KEY="sk-your-key-here"

Step 1 — Load and Split Your Documents

First, load your PDF documents and split them into chunks. Chunking is essential because LLMs have limited context windows, and smaller chunks produce more precise retrieval.

from langchain_community.document_loaders import PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load a PDF

loader = PyPDFLoader("your-document.pdf")

pages = loader.load()

# Split into chunks of ~500 tokens with overlap

splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

separators=["\n\n", "\n", ". ", " "]

)

chunks = splitter.split_documents(pages)

print(f"Created {len(chunks)} chunks from {len(pages)} pages")

The chunk_overlap parameter ensures that sentences split across chunk boundaries are still captured. A value of 200 characters works well for most documents.

Step 2 — Create a Vector Store with ChromaDB

Next, convert your text chunks into vector embeddings and store them in ChromaDB, a lightweight vector database that runs locally.

from langchain_openai import OpenAIEmbeddings

from langchain_community.vectorstores import Chroma

# Create embeddings using OpenAI's text-embedding-3-small model

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Store in ChromaDB (persists to disk)

vectorstore = Chroma.from_documents(

documents=chunks,

embedding=embeddings,

persist_directory="./chroma_db"

)

print(f"Stored {len(chunks)} vectors in ChromaDB")

The embedding model converts each text chunk into a 1536-dimensional vector. When you ask a question, it converts your question into a vector too, then finds the closest matching chunks using cosine similarity.

Step 3 — Build the RAG Chain

Now connect everything into a retrieval chain that fetches relevant context and generates an answer:

from langchain_openai import ChatOpenAI

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

# Initialize the LLM

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

# Create a retriever that fetches the top 4 most relevant chunks

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Custom prompt template

prompt = PromptTemplate(

template="""Use the following context to answer the question. If the answer is not in the context, say "I don't have enough information to answer that."

Context: {context}

Question: {question}

Answer:""",

input_variables=["context", "question"]

)

# Build the chain

qa_chain = RetrievalQA.from_chain_type(

llm=llm,

retriever=retriever,

chain_type_kwargs={"prompt": prompt}

)

# Ask a question

result = qa_chain.invoke("What are the key findings in this document?")

print(result["result"])

Step 4 — Add Conversation Memory

To make it a proper chatbot that remembers previous messages, add conversation memory:

from langchain.memory import ConversationBufferWindowMemory

from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferWindowMemory(

memory_key="chat_history",

return_messages=True,

k=5 # Remember last 5 exchanges

)

chat_chain = ConversationalRetrievalChain.from_llm(

llm=llm,

retriever=retriever,

memory=memory,

)

# First question

result = chat_chain.invoke({"question": "What is this document about?"})

print(result["answer"])

# Follow-up (it remembers the context)

result = chat_chain.invoke({"question": "Can you elaborate on the second point?"})

print(result["answer"])

The ConversationBufferWindowMemory keeps the last 5 exchanges in memory, which is enough for most conversations without exceeding token limits.

Cost Optimization Tips

RAG with OpenAI can get expensive if not managed well. Here are practical tips:

  • 1. Use text-embedding-3-small instead of ada-002 — it's 5x cheaper and performs better.
  • 2. Use gpt-4o-mini for most queries — it's 15x cheaper than gpt-4o and handles RAG well.
  • 3. Cache embeddings — don't re-embed documents that haven't changed.
  • 4. Limit retrieved chunks — 3-4 chunks is usually enough. More ≠ better.
  • 5. Truncate long chunks — if a chunk exceeds 500 tokens, the signal-to-noise ratio drops.

With these optimizations, a typical RAG application costs $0.001-0.005 per query — roughly $1 for 200-1000 queries.

Conclusion

You've built a complete RAG chatbot that can answer questions from any document. The key components are:

  • Document loading and chunking — split your docs into searchable pieces
  • Vector embeddings — convert text to numbers for similarity search
  • Retrieval — find the most relevant chunks for each question
  • Generation — use an LLM to synthesize a natural language answer

To try document Q&A without building anything, use Reformat's Chat with Document tool — upload any PDF and start asking questions instantly.

Next steps: Add a web UI with Streamlit or Gradio, support multiple file formats, or deploy as an API with FastAPI.

Try These Tools

Mentioned in this article — free, no sign-up required.

raglangchainopenaichatbotvector databaseaipython

Related Articles