This guide shows you how to make a chatbot answer from your own documents in under thirty minutes, using nothing but the openai SDK and numpy. By the end you will have a runnable Python script that reads your files, finds the passages relevant to any question, and feeds them to the model so it stops guessing and starts citing your facts.
The technique is called RAG, short for Retrieval-Augmented Generation. In plain terms: before the model writes an answer, your code retrieves the few passages from your documents that best match the question, then the model generates its reply using that supplied text. The model is never trained on your data; you simply hand it the right reference material at the moment it answers, the way you might slide an open manual across the desk before asking a colleague a question.
This is one of the guides under Custom AI Chatbot Development. If you have not built a basic bot yet, start there, then come back to make it answer from your knowledge base.
Why RAG instead of just asking the model?
A general model knows nothing about your return policy, your pricing, or last week's release notes, and it will confidently invent an answer rather than admit the gap. You could paste your entire handbook into every prompt, but that is slow, expensive, and eventually too large for the model to read. RAG is the middle path: store your documents once, and at question time attach only the handful of passages that actually matter.
The matching happens through embeddings — numeric fingerprints of text where similar meanings land close together in mathematical space. The question "how long do I have to return something?" and the sentence "Returns are accepted within 30 days" produce vectors pointing in nearly the same direction, even though they share almost no words. That is the whole trick: embeddings match on meaning, not keywords, so customers do not have to phrase questions exactly the way your documents are written.
Prerequisites
You need Python 3.10 or newer (python3 --version to check) and an OpenAI API key. This guide assumes you have already met the parent section's setup; if not, Custom AI Chatbot Development covers installing the SDK and storing your key.
Work inside a virtual environment and install the two libraries this guide adds on top of the SDK:
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "httpx>=0.27" python-dotenv numpy
Store your key and model names in a .env file so they never end up in your code:
# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini
EMBED_MODEL=text-embedding-3-small
Add .env to your .gitignore immediately so the key is never committed:
echo ".env" >> .gitignore
If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python covers every cause.
Step 1: Split your documents into chunks
You cannot embed a whole 50-page handbook as one vector — the meaning would blur into mush and every search would return the entire document. Instead you chunk it: split the text into passages of a few hundred words each, with a small overlap so a sentence split across a boundary still appears whole in one chunk.
The function below splits on words and slides a window forward, leaving an overlap between neighbours. It works on any plain string, so you can feed it the contents of a .txt or .md file.
def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
"""Split text into overlapping word-based chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunks.append(" ".join(words[start:end]))
start += chunk_size - overlap # step forward, keeping an overlap
return chunks
sample = open("handbook.txt", encoding="utf-8").read()
chunks = chunk_text(sample)
print(f"Split into {len(chunks)} chunks")
The overlap matters: without it, a fact that lands on a chunk boundary gets cut in half and may never match cleanly. Fifty words is a safe default. If your documents are already short and self-contained — like FAQ entries or product blurbs — skip chunking and treat each entry as one chunk.
Step 2: Create embeddings for each chunk
Now turn every chunk into a vector. You send your chunks to the embeddings endpoint and get back one list of numbers per chunk. You do this once at startup (or whenever your documents change) and keep the result in memory; embedding is the slow, paid part, so you never want to repeat it per question.
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
def embed(texts: list[str]) -> np.ndarray:
"""Return a 2-D array: one embedding row per input text."""
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return np.array([item.embedding for item in resp.data])
chunk_vectors = embed(chunks) # embed the whole knowledge base once
print(chunk_vectors.shape) # e.g. (42, 1536) -> 42 chunks, 1536 numbers each
The shape tells you everything: one row per chunk, and a fixed number of columns (1536 for text-embedding-3-small) that is the same for every piece of text. Because the width is fixed, you can compare questions and documents in the same space directly. You can embed a couple of thousand chunks in a single call, far faster than looping one at a time.
Step 3: Search the vectors for the question
To answer a question, embed it the same way, then measure which chunk vectors point most nearly the same direction as the question vector. The standard measure is cosine similarity: it scores two vectors from -1 (opposite) to 1 (identical direction), ignoring their length so longer passages are not unfairly favoured. You keep the highest-scoring top_k chunks.
def retrieve(question: str, chunks: list[str], chunk_vectors: np.ndarray,
top_k: int = 3) -> list[str]:
"""Return the top_k chunks most similar to the question."""
q = embed([question])[0]
# cosine similarity = dot product of length-normalised vectors
scores = chunk_vectors @ q / (
np.linalg.norm(chunk_vectors, axis=1) * np.linalg.norm(q)
)
best = scores.argsort()[::-1][:top_k] # indices of the highest scores
return [chunks[i] for i in best]
hits = retrieve("How long do I have to return an item?", chunks, chunk_vectors)
for h in hits:
print("-", h[:80], "...")
scores.argsort()[::-1] sorts the indices from lowest to highest score and then reverses them, so the best matches come first; the [:top_k] slice keeps only as many as you asked for. This is a brute-force search that compares the question against every chunk. That sounds expensive, but for a few hundred or even a few thousand chunks numpy does it in a blink. Reach for a dedicated vector store only when your collection grows into the tens of thousands.
Step 4: Inject the top-k chunks into the prompt
The final step joins the retrieved chunks into a context block and pastes it into the system prompt, with a strict instruction to answer only from that text. This is what turns "the model's best guess" into "an answer grounded in your documents."
def answer(question: str, chunks: list[str], chunk_vectors: np.ndarray) -> str:
context = "\n\n".join(retrieve(question, chunks, chunk_vectors))
response = client.chat.completions.create(
model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
messages=[
{"role": "system", "content": (
"You answer questions using ONLY the context below. "
"If the answer is not in the context, say you don't know.\n\n"
f"Context:\n{context}"
)},
{"role": "user", "content": question},
],
temperature=0,
)
return response.choices[0].message.content
print(answer("How long do I have to return an item?", chunks, chunk_vectors))
Two details carry the reliability. First, temperature=0 makes the model deterministic and factual rather than creative — exactly what you want when it should be reading off your documents. Second, the "answer ONLY from the context" instruction is the single line that stops most invented answers; without it the model will happily blend your context with whatever it half-remembers. Building strict, format-controlling system prompts is a skill in itself, covered in Write System Prompts that Control Output Format.
Parameter quick reference
These three knobs control retrieval quality. Tune chunk_size and top_k first; the embedding model rarely needs changing.
| Parameter | Typical value | Effect |
|---|---|---|
chunk_size | 200-500 words | Smaller chunks give sharper, more precise matches; larger chunks keep related sentences together but dilute relevance. |
top_k | 3-5 chunks | How many passages to inject. More gives the model fuller context but costs tokens and can bury the key fact; fewer is cheaper but risks missing the answer. |
EMBED_MODEL | text-embedding-3-small | The cheap, fast default fits almost every use. Switch to text-embedding-3-large only if matches are noticeably weak and the extra cost is justified. |
Troubleshooting
- The bot answers "I don't know" when the answer clearly exists — Retrieval missed the right chunk. Cause: chunks too large so the relevant fact was diluted, or
top_ktoo low. Fix: shrinkchunk_sizetoward 200 words and raisetop_kto 5, then print the retrieved chunks to confirm the fact is in them. openai.BadRequestErrorabout input length onembeddings.create— A single chunk is too long for the embedding model. Cause: a document with no whitespace or an overly largechunk_size. Fix: lowerchunk_size, and confirmchunk_textactually split the text rather than returning one giant chunk.- All similarity scores look almost identical — Your chunks are too similar or too generic to tell apart. Cause: boilerplate text repeated across passages, or chunks so large every one touches every topic. Fix: chunk more finely and strip repeated headers or footers before embedding.
BadRequestError: maximum context lengthwhen answering — The injected context plus the question is too long for the chat model. Cause:top_ktoo high or chunks too large. Fix: lowertop_k, shrinkchunk_size, or read Fix the Context-Length-Exceeded Error in Python.
Worked example: a runnable RAG bot
This script ties all four steps into one program. It chunks a small in-line knowledge base, embeds it once, then answers questions from the terminal grounded in those chunks. Save it as rag_bot.py, make sure your .env is in place, and run python rag_bot.py.
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(timeout=20.0)
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
# A tiny knowledge base. In a real app, load these from your files.
DOCUMENTS = """
Returns are accepted within 30 days of purchase with a valid receipt.
Refunds are issued to the original payment method within 5 business days.
Kids' helmets are available in red, blue, and matte black, sizes XS to L.
Free local delivery applies to all orders over $75 within the city.
Our workshop offers free safety checks every Saturday from 9am to noon.
"""
def chunk_text(text: str, chunk_size: int = 60, overlap: int = 10) -> list[str]:
words = text.split()
chunks, start = [], 0
while start < len(words):
chunks.append(" ".join(words[start:start + chunk_size]))
start += chunk_size - overlap
return chunks
def embed(texts: list[str]) -> np.ndarray:
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return np.array([item.embedding for item in resp.data])
CHUNKS = chunk_text(DOCUMENTS) # split the knowledge base
CHUNK_VECTORS = embed(CHUNKS) # embed it once at startup
def retrieve(question: str, top_k: int = 3) -> str:
q = embed([question])[0]
scores = CHUNK_VECTORS @ q / (np.linalg.norm(CHUNK_VECTORS, axis=1) * np.linalg.norm(q))
best = scores.argsort()[::-1][:top_k]
return "\n\n".join(CHUNKS[i] for i in best)
def answer(question: str) -> str:
context = retrieve(question)
response = client.chat.completions.create(
model=CHAT_MODEL,
messages=[
{"role": "system", "content": "Answer using ONLY the context below. "
"If it is not in the context, say you don't know.\n\nContext:\n" + context},
{"role": "user", "content": question},
],
temperature=0,
)
return response.choices[0].message.content
if __name__ == "__main__":
print("Docs bot ready. Type 'quit' to exit.")
while True:
msg = input("You: ").strip()
if msg.lower() in {"quit", "exit"}:
break
print("Bot:", answer(msg))
That is a complete RAG chatbot in well under sixty lines, with no vector database and no framework. Swap the in-line DOCUMENTS string for the contents of your real files and you have a bot that answers from your knowledge base.
When to use this vs. alternatives
RAG is one of three ways to make a model speak with your knowledge. Pick by what you actually need to change:
- Use RAG when the model needs your facts — policies, prices, product details, anything that changes or that the model could not have memorised. It is cheap, updates instantly when you re-embed, and lets the model cite the exact passage it used. This is the right default for documentation and support bots.
- Use fine-tuning when you need to change style or format, not facts — a consistent brand voice, a strict output shape, or a behaviour the model resists. Fine-tuning bakes that pattern in, but it requires a training run, is awkward to update, and is a poor way to teach facts that change.
- Use long context (paste everything) only for small, one-off documents — if your whole knowledge fits comfortably in one prompt and rarely changes, skip retrieval and paste it directly. It is the simplest option, but it gets slow and expensive fast and breaks once your documents outgrow the model's context window.
In short: facts that change → RAG; behaviour that persists → fine-tuning; a small fixed document → long context. Most business bots want RAG.
Next steps
Now that your bot answers from your data, deepen the surrounding chatbot. Make replies appear word by word with Stream Chatbot Responses with Python, and give it durable, per-user history with Add Memory to a Python Chatbot. To wrap retrieval, memory, and routing in a framework, see Build a Customer Support Chatbot with LangChain.
Back to Custom AI Chatbot Development.