What is RAG in simple terms?

RAG stands for Retrieval-Augmented Generation. Before the model answers, your code searches your own documents for the few passages most relevant to the question and pastes them into the prompt. The model then answers from that supplied text instead of guessing from general knowledge.

Do I need a vector database to use RAG?

No. For a few hundred passages a plain numpy array of embeddings and a cosine-similarity search is fast and simple. You only need a dedicated vector database once you have thousands of documents or need to update them while the app is running.

Why does the chatbot still make things up after I add RAG?

Usually the retrieved context did not actually contain the answer, or the system prompt did not forbid outside knowledge. Confirm your search returns relevant chunks and instruct the model to answer only from the context and to say it does not know otherwise.

How big should each document chunk be?

Aim for roughly 200 to 500 words per chunk with a small overlap between neighbours. Smaller chunks give sharper matches but more fragments to manage; larger chunks keep context together but dilute the match. Start around 300 words and adjust if answers feel incomplete.

Is RAG cheaper than fine-tuning a model?

For most teams, yes. RAG needs no training run and you update knowledge by editing files and re-embedding them, which costs fractions of a cent. Fine-tuning changes the model's style or format but is poorly suited to teaching it new facts that change often.

Connect a Chatbot to Your Docs with RAG

This guide shows you how to make a chatbot answer from your own documents in under thirty minutes, using nothing but the openai SDK and numpy. By the end you will have a runnable Python script that reads your files, finds the passages relevant to any question, and feeds them to the model so it stops guessing and starts citing your facts.

The technique is called RAG, short for Retrieval-Augmented Generation. In plain terms: before the model writes an answer, your code retrieves the few passages from your documents that best match the question, then the model generates its reply using that supplied text. The model is never trained on your data; you simply hand it the right reference material at the moment it answers, the way you might slide an open manual across the desk before asking a colleague a question.

This is one of the guides under Custom AI Chatbot Development. If you have not built a basic bot yet, start there, then come back to make it answer from your knowledge base.

Why RAG instead of just asking the model?

A general model knows nothing about your return policy, your pricing, or last week's release notes, and it will confidently invent an answer rather than admit the gap. You could paste your entire handbook into every prompt, but that is slow, expensive, and eventually too large for the model to read. RAG is the middle path: store your documents once, and at question time attach only the handful of passages that actually matter.

The matching happens through embeddings — numeric fingerprints of text where similar meanings land close together in mathematical space. The question "how long do I have to return something?" and the sentence "Returns are accepted within 30 days" produce vectors pointing in nearly the same direction, even though they share almost no words. That is the whole trick: embeddings match on meaning, not keywords, so customers do not have to phrase questions exactly the way your documents are written.

RAG in one picture: embed the question, search your embedded chunks, hand the best matches plus the question to the model.

Prerequisites

You need Python 3.10 or newer (python3 --version to check) and an OpenAI API key. This guide assumes you have already met the parent section's setup; if not, Custom AI Chatbot Development covers installing the SDK and storing your key.

Work inside a virtual environment and install the two libraries this guide adds on top of the SDK:

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "httpx>=0.27" python-dotenv numpy

Store your key and model names in a .env file so they never end up in your code:

# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini
EMBED_MODEL=text-embedding-3-small

Add .env to your .gitignore immediately so the key is never committed:

echo ".env" >> .gitignore

If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python covers every cause.

Step 1: Split your documents into chunks

You cannot embed a whole 50-page handbook as one vector — the meaning would blur into mush and every search would return the entire document. Instead you chunk it: split the text into passages of a few hundred words each, with a small overlap so a sentence split across a boundary still appears whole in one chunk.

The function below splits on words and slides a window forward, leaving an overlap between neighbours. It works on any plain string, so you can feed it the contents of a .txt or .md file.

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    """Split text into overlapping word-based chunks."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap   # step forward, keeping an overlap
    return chunks


sample = open("handbook.txt", encoding="utf-8").read()
chunks = chunk_text(sample)
print(f"Split into {len(chunks)} chunks")

The overlap matters: without it, a fact that lands on a chunk boundary gets cut in half and may never match cleanly. Fifty words is a safe default. If your documents are already short and self-contained — like FAQ entries or product blurbs — skip chunking and treat each entry as one chunk.

Step 2: Create embeddings for each chunk

Now turn every chunk into a vector. You send your chunks to the embeddings endpoint and get back one list of numbers per chunk. You do this once at startup (or whenever your documents change) and keep the result in memory; embedding is the slow, paid part, so you never want to repeat it per question.

import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")


def embed(texts: list[str]) -> np.ndarray:
    """Return a 2-D array: one embedding row per input text."""
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([item.embedding for item in resp.data])


chunk_vectors = embed(chunks)        # embed the whole knowledge base once
print(chunk_vectors.shape)           # e.g. (42, 1536) -> 42 chunks, 1536 numbers each

The shape tells you everything: one row per chunk, and a fixed number of columns (1536 for text-embedding-3-small) that is the same for every piece of text. Because the width is fixed, you can compare questions and documents in the same space directly. You can embed a couple of thousand chunks in a single call, far faster than looping one at a time.

Step 3: Search the vectors for the question

To answer a question, embed it the same way, then measure which chunk vectors point most nearly the same direction as the question vector. The standard measure is cosine similarity: it scores two vectors from -1 (opposite) to 1 (identical direction), ignoring their length so longer passages are not unfairly favoured. You keep the highest-scoring top_k chunks.

def retrieve(question: str, chunks: list[str], chunk_vectors: np.ndarray,
             top_k: int = 3) -> list[str]:
    """Return the top_k chunks most similar to the question."""
    q = embed([question])[0]
    # cosine similarity = dot product of length-normalised vectors
    scores = chunk_vectors @ q / (
        np.linalg.norm(chunk_vectors, axis=1) * np.linalg.norm(q)
    )
    best = scores.argsort()[::-1][:top_k]   # indices of the highest scores
    return [chunks[i] for i in best]


hits = retrieve("How long do I have to return an item?", chunks, chunk_vectors)
for h in hits:
    print("-", h[:80], "...")

scores.argsort()[::-1] sorts the indices from lowest to highest score and then reverses them, so the best matches come first; the [:top_k] slice keeps only as many as you asked for. This is a brute-force search that compares the question against every chunk. That sounds expensive, but for a few hundred or even a few thousand chunks numpy does it in a blink. Reach for a dedicated vector store only when your collection grows into the tens of thousands.

Step 4: Inject the top-k chunks into the prompt

The final step joins the retrieved chunks into a context block and pastes it into the system prompt, with a strict instruction to answer only from that text. This is what turns "the model's best guess" into "an answer grounded in your documents."

def answer(question: str, chunks: list[str], chunk_vectors: np.ndarray) -> str:
    context = "\n\n".join(retrieve(question, chunks, chunk_vectors))
    response = client.chat.completions.create(
        model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
        messages=[
            {"role": "system", "content": (
                "You answer questions using ONLY the context below. "
                "If the answer is not in the context, say you don't know.\n\n"
                f"Context:\n{context}"
            )},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


print(answer("How long do I have to return an item?", chunks, chunk_vectors))

Two details carry the reliability. First, temperature=0 makes the model deterministic and factual rather than creative — exactly what you want when it should be reading off your documents. Second, the "answer ONLY from the context" instruction is the single line that stops most invented answers; without it the model will happily blend your context with whatever it half-remembers. Building strict, format-controlling system prompts is a skill in itself, covered in Write System Prompts that Control Output Format.

Parameter quick reference

These three knobs control retrieval quality. Tune chunk_size and top_k first; the embedding model rarely needs changing.

Parameter	Typical value	Effect
`chunk_size`	200-500 words	Smaller chunks give sharper, more precise matches; larger chunks keep related sentences together but dilute relevance.
`top_k`	3-5 chunks	How many passages to inject. More gives the model fuller context but costs tokens and can bury the key fact; fewer is cheaper but risks missing the answer.
`EMBED_MODEL`	`text-embedding-3-small`	The cheap, fast default fits almost every use. Switch to `text-embedding-3-large` only if matches are noticeably weak and the extra cost is justified.

Troubleshooting

The bot answers "I don't know" when the answer clearly exists — Retrieval missed the right chunk. Cause: chunks too large so the relevant fact was diluted, or top_k too low. Fix: shrink chunk_size toward 200 words and raise top_k to 5, then print the retrieved chunks to confirm the fact is in them.
openai.BadRequestError about input length on embeddings.create — A single chunk is too long for the embedding model. Cause: a document with no whitespace or an overly large chunk_size. Fix: lower chunk_size, and confirm chunk_text actually split the text rather than returning one giant chunk.
All similarity scores look almost identical — Your chunks are too similar or too generic to tell apart. Cause: boilerplate text repeated across passages, or chunks so large every one touches every topic. Fix: chunk more finely and strip repeated headers or footers before embedding.
BadRequestError: maximum context length when answering — The injected context plus the question is too long for the chat model. Cause: top_k too high or chunks too large. Fix: lower top_k, shrink chunk_size, or read Fix the Context-Length-Exceeded Error in Python.

Worked example: a runnable RAG bot

This script ties all four steps into one program. It chunks a small in-line knowledge base, embeds it once, then answers questions from the terminal grounded in those chunks. Save it as rag_bot.py, make sure your .env is in place, and run python rag_bot.py.

import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(timeout=20.0)
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")

# A tiny knowledge base. In a real app, load these from your files.
DOCUMENTS = """
Returns are accepted within 30 days of purchase with a valid receipt.
Refunds are issued to the original payment method within 5 business days.
Kids' helmets are available in red, blue, and matte black, sizes XS to L.
Free local delivery applies to all orders over $75 within the city.
Our workshop offers free safety checks every Saturday from 9am to noon.
"""


def chunk_text(text: str, chunk_size: int = 60, overlap: int = 10) -> list[str]:
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + chunk_size]))
        start += chunk_size - overlap
    return chunks


def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([item.embedding for item in resp.data])


CHUNKS = chunk_text(DOCUMENTS)          # split the knowledge base
CHUNK_VECTORS = embed(CHUNKS)           # embed it once at startup


def retrieve(question: str, top_k: int = 3) -> str:
    q = embed([question])[0]
    scores = CHUNK_VECTORS @ q / (np.linalg.norm(CHUNK_VECTORS, axis=1) * np.linalg.norm(q))
    best = scores.argsort()[::-1][:top_k]
    return "\n\n".join(CHUNKS[i] for i in best)


def answer(question: str) -> str:
    context = retrieve(question)
    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": "Answer using ONLY the context below. "
             "If it is not in the context, say you don't know.\n\nContext:\n" + context},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


if __name__ == "__main__":
    print("Docs bot ready. Type 'quit' to exit.")
    while True:
        msg = input("You: ").strip()
        if msg.lower() in {"quit", "exit"}:
            break
        print("Bot:", answer(msg))

That is a complete RAG chatbot in well under sixty lines, with no vector database and no framework. Swap the in-line DOCUMENTS string for the contents of your real files and you have a bot that answers from your knowledge base.

When to use this vs. alternatives

RAG is one of three ways to make a model speak with your knowledge. Pick by what you actually need to change:

Use RAG when the model needs your facts — policies, prices, product details, anything that changes or that the model could not have memorised. It is cheap, updates instantly when you re-embed, and lets the model cite the exact passage it used. This is the right default for documentation and support bots.
Use fine-tuning when you need to change style or format, not facts — a consistent brand voice, a strict output shape, or a behaviour the model resists. Fine-tuning bakes that pattern in, but it requires a training run, is awkward to update, and is a poor way to teach facts that change.
Use long context (paste everything) only for small, one-off documents — if your whole knowledge fits comfortably in one prompt and rarely changes, skip retrieval and paste it directly. It is the simplest option, but it gets slow and expensive fast and breaks once your documents outgrow the model's context window.

In short: facts that change → RAG; behaviour that persists → fine-tuning; a small fixed document → long context. Most business bots want RAG.

Next steps

Now that your bot answers from your data, deepen the surrounding chatbot. Make replies appear word by word with Stream Chatbot Responses with Python, and give it durable, per-user history with Add Memory to a Python Chatbot. To wrap retrieval, memory, and routing in a framework, see Build a Customer Support Chatbot with LangChain.

Back to Custom AI Chatbot Development.

Connect a Chatbot to Your Docs with RAG

Related pages in this content path

Add Memory to a Python Chatbot

Build a Customer Support Chatbot with LangChain

Stream Chatbot Responses with Python