A chatbot you build yourself does things the drag-and-drop platforms never will: it answers from your documents, follows your tone, plugs into your tools, and never charges you per seat. The catch is that most tutorials jump straight to heavyweight frameworks and leave you with a black box you cannot debug. This guide does the opposite. You will build a working chatbot from a single API call, then add the three features that separate a toy from something you can put in front of customers: memory, retrieval, and graceful error handling.
You do not need a machine learning background. The model you will use is already trained and hosted by a provider; your job is to send it the right messages and handle the replies. If you can write a for loop and read a dictionary, you can finish this guide. By the end you will have a runnable chatbot script and a clear map of which child guide to read next for each feature you want to deepen.
This is one section of Building AI-Powered Business Applications. If you are brand new to calling models from Python, read Understanding LLM APIs first to see how requests, keys, and responses fit together.
What a chatbot actually does
Before the code, hold one picture in your head. A chatbot is a loop. The user sends a message, your Python app gathers some context (the conversation so far, maybe a few relevant document snippets), sends all of it to the model, and shows the reply back to the user. Everything else in this guide is just making each part of that loop smarter.
The single most important thing to understand is that the model is stateless. It has no memory between calls and no live connection to your business. Each API request is a clean slate: it sees only the text you send in that one request and nothing else. That sounds like a limitation, but it is actually freeing, because it means every "smart" feature reduces to the same job — deciding what text to put into the next request. Memory is text you re-send. Retrieval is text you look up and attach. Tools are descriptions of functions you include. Once that clicks, the rest of this guide is mechanical.
Prerequisites
You need Python 3.10 or newer. Check with python3 --version. If that command fails or shows an older version, follow Setting Up Python for AI first.
Work inside a virtual environment so this project's packages stay isolated from the rest of your system. If virtual environments are new to you, Create a Python Virtual Environment for AI walks through it.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "httpx>=0.27" python-dotenv numpy
You also need an API key from your model provider. Create a .env file in your project folder to hold it so it never gets pasted into code:
# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini
EMBED_MODEL=text-embedding-3-small
Add .env to your .gitignore immediately so your secret key is never committed to version control. One leaked key can run up a real bill.
echo ".env" >> .gitignore
If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python covers every cause.
Step 1: Send your first chat message
Start with the smallest possible chatbot: one message in, one reply out. The openai SDK reads your key from the environment automatically, so you only describe what you want, not how to authenticate.
Two ideas to know. The system message sets the bot's personality and rules; the user never sees it. The user message is what the person typed. You send both as a list of dictionaries, each with a role and content.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI() # reads OPENAI_API_KEY from the environment
response = client.chat.completions.create(
model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
messages=[
{"role": "system", "content": "You are a concise, friendly support assistant for a bike shop."},
{"role": "user", "content": "Do you sell helmets for kids?"},
],
temperature=0.3,
)
print(response.choices[0].message.content)
Run it. You should see a short, on-brand answer. The reply lives at response.choices[0].message.content — that nesting trips up everyone once, so commit it to memory. The choices part is a list because the model can return several alternative answers in one call; you almost always want the first one. temperature=0.3 keeps the answer focused; raise it toward 1.0 for more creative, varied wording.
Why a small model like gpt-4o-mini? For a support bot, speed and cost matter more than raw reasoning power, and a small model answers a grounded question just as well as a large one for a fraction of the price. Start small; only reach for a bigger model if you see the bot fumbling genuinely hard reasoning. If you are still deciding which provider and model to start with, Best Free AI APIs for Beginners compares the options without commitment.
The system prompt does a lot of heavy lifting here — it is where you set tone, scope, and the rules the bot must never break. A vague system prompt produces a vague, rambling bot, so be specific about what it should and should not do. To learn how to make it reliably enforce tone, format, and refusals, read Write System Prompts that Control Output Format.
Step 2: Add conversation memory
The call above forgets everything the instant it returns. Ask "what colours does it come in?" next and the model has no idea what "it" means. The fix is simple: keep a growing list of messages and resend the whole list every turn. The model has no hidden memory of its own — you are the memory.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
# The running history. The system message stays pinned at the front.
messages = [
{"role": "system", "content": "You are a concise, friendly support assistant for a bike shop."},
]
def ask(user_text: str) -> str:
messages.append({"role": "user", "content": user_text})
response = client.chat.completions.create(
model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
messages=messages,
temperature=0.3,
)
reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": reply}) # remember the answer too
return reply
print(ask("Do you sell helmets for kids?"))
print(ask("What colours does it come in?")) # 'it' now resolves correctly
Notice that you append both the user message and the assistant's reply. If you only stored user turns, the bot would lose its own answers and contradict itself. This list-based approach is everything memory really is. The catch is that the list grows forever, and every model has a limit on how much text it can read at once. When a long chat eventually hits that wall you will see a context-length error; Fix the Context-Length-Exceeded Error in Python shows how to trim or summarise old turns. For production-grade strategies like storing history in a database keyed by session, see Add Memory to a Python Chatbot.
Step 3: Ground answers in your own data with retrieval
A general model knows nothing about your return policy or your product catalogue, and it will happily invent an answer rather than admit it. Retrieval fixes this. Before answering, you find the snippets of your own documents most relevant to the question and paste them into the prompt as context. This pattern is called RAG — Retrieval-Augmented Generation.
The matching step uses embeddings: numeric fingerprints of text where similar meanings sit close together in mathematical space. The phrase "kids' helmet colours" and the sentence "Kids' helmets come in red, blue, and matte black" produce vectors that point in nearly the same direction, even though they share few exact words. That is the magic — embeddings match on meaning, not keywords, so a customer does not have to phrase their question exactly the way your document is written.
The workflow is three steps: embed your documents once and keep the vectors, embed the user's question at query time, then measure which document vectors point most nearly the same way as the question vector. Here is a minimal, dependency-light version using numpy for the similarity maths so you can see exactly what happens with no framework hiding the logic.
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
# Your knowledge base. In a real app these come from files or a database.
docs = [
"Returns are accepted within 30 days with a receipt.",
"Kids' helmets come in red, blue, and matte black.",
"We offer free local delivery on orders over $75.",
]
def embed(texts: list[str]) -> np.ndarray:
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return np.array([d.embedding for d in resp.data])
doc_vectors = embed(docs) # embed the knowledge base once at startup
def retrieve(question: str, top_k: int = 2) -> list[str]:
q_vector = embed([question])[0]
# cosine similarity = dot product of unit-normalised vectors
scores = doc_vectors @ q_vector / (
np.linalg.norm(doc_vectors, axis=1) * np.linalg.norm(q_vector)
)
best = scores.argsort()[::-1][:top_k]
return [docs[i] for i in best]
def answer_with_context(question: str) -> str:
context = "\n".join(retrieve(question))
response = client.chat.completions.create(
model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
messages=[
{"role": "system", "content": (
"Answer using only the context below. "
"If the answer is not in the context, say you don't know.\n\n"
f"Context:\n{context}"
)},
{"role": "user", "content": question},
],
temperature=0,
)
return response.choices[0].message.content
print(answer_with_context("What colours do the kids helmets come in?"))
The system prompt does the safety work: it tells the model to answer only from the context and to admit ignorance otherwise. That single instruction removes most invented answers. For a few documents, numpy is plenty; once you have thousands of snippets you will want a proper vector store. The full pipeline — chunking files, persisting vectors, and scaling search — lives in Connect a Chatbot to Your Docs with RAG.
Step 4: Handle errors and rate limits
A chatbot that crashes on the first network hiccup is not ready for anyone. Three things go wrong in the real world: the network drops, the provider is briefly overloaded (a rate limit), and the model returns something your code did not expect. Wrap your call so a single failure degrades into a polite message instead of a stack trace, and retry the temporary ones.
import os
import time
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError, APIError
load_dotenv()
client = OpenAI(timeout=20.0) # never wait forever on a slow response
def safe_chat(messages: list[dict], retries: int = 3) -> str:
for attempt in range(retries):
try:
response = client.chat.completions.create(
model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
messages=messages,
temperature=0.3,
)
return response.choices[0].message.content
except RateLimitError:
wait = 2 ** attempt # back off: 1s, 2s, 4s
print(f"Rate limited, retrying in {wait}s...")
time.sleep(wait)
except APIError as exc:
print(f"API error: {exc}")
break
return "Sorry, I'm having trouble right now. Please try again in a moment."
The 2 ** attempt pattern is exponential backoff: each retry waits longer (one second, then two, then four), giving the provider room to recover instead of hammering it with instant retries that would only deepen the overload. Notice the two error types are handled differently: a RateLimitError is temporary, so you retry it, while a general APIError usually signals a real problem with your request, so you stop and surface it rather than looping uselessly. Setting timeout=20.0 on the client means a stuck request fails fast rather than freezing your whole app while one user waits on a hung connection. If rate limits become a regular problem rather than a rare blip, Fix the 429 Rate-Limit Error in Python explains the limits and how to stay under them, and Rate-Limit AI API Calls in a SaaS with Python shows how to throttle your own users fairly.
Plain SDK or LangChain?
You have now built a complete chatbot with nothing but the openai SDK, numpy, and Python lists. That is deliberate: you can see and debug every moving part, and there is no framework version churn to chase. For most business bots, this is all you ever need.
LangChain earns its place when your loop grows complicated — when the bot must decide between several tools, chain many steps, swap between providers behind one interface, or use prebuilt retrieval and memory components so you write less plumbing. The trade is a steeper learning curve and more abstraction between you and the API, which can make bugs harder to trace. A practical rule: start with the plain SDK as shown here, and reach for LangChain only when you feel yourself rebuilding its features by hand. The fully framework-based version, with routing and fallbacks, is covered in Build a Customer Support Chatbot with LangChain.
Parameter reference
These are the settings you will reach for most when calling the chat endpoint. Tune temperature first; leave the rest at their defaults until you have a reason to change them.
| Parameter | Type | Default | Effect |
|---|---|---|---|
model | str | none (required) | Which model answers. gpt-4o-mini is cheap and fast; larger models reason better at higher cost. |
messages | listdict | none (required) | The full conversation: system, user, and assistant turns in order. |
temperature | float | 1.0 | Randomness. 0 is deterministic and factual; 1.0+ is creative and varied. Use low values for support bots. |
max_tokens | int | model max | Hard cap on reply length. Set it to control cost and stop runaway answers. |
top_p | float | 1.0 | Alternative to temperature that limits word choice to the most likely options. Change one, not both. |
timeout | float | none | Seconds to wait before giving up on a request. Set it on the client so a slow call cannot hang your app. |
stream | bool | False | When True, tokens arrive as they are generated for a live typing effect. |
Troubleshooting
AuthenticationError: No API key provided— Your key is not loaded. Cause:load_dotenv()was not called, or.envis in the wrong folder. Fix: callload_dotenv()before creating the client and run your script from the folder that holds.env.RateLimitError: 429— You sent requests faster than your plan allows, or you are out of credit. Cause: a loop firing calls with no pause, or an empty balance. Fix: add the exponential backoff from Step 4 and check your billing dashboard.BadRequestError: maximum context length— The conversation plus context is too long for the model. Cause: an ever-growingmessageslist. Fix: keep only the last several turns, or summarise older ones into one short note.AttributeError: 'NoneType' object has no attribute ...— You read the reply from the wrong place. Cause: the reply is atresponse.choices[0].message.content, notresponse.content. Fix: use the full path.- The bot answers from general knowledge instead of your documents — Retrieval context was empty or ignored. Cause: no snippets matched, or the system prompt did not forbid outside answers. Fix: confirm
retrieve()returns text and add "answer only from the context" to the system message. httpx.ReadTimeout— The request took longer than your timeout. Cause: a large request or a slow network. Fix: raise the clienttimeout, shorten the prompt, or switch to a faster model.
Worked example: a complete chatbot
This script ties every step together into one runnable program: it loads a small knowledge base, retrieves relevant context per question, remembers the conversation, and survives errors with retries. Save it as chatbot.py, make sure your .env is in place, and run python chatbot.py.
import os
import time
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError
load_dotenv() # load keys from .env
client = OpenAI(timeout=20.0) # fail fast on slow calls
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
KNOWLEDGE = [ # your facts live here
"Returns are accepted within 30 days with a receipt.",
"Kids' helmets come in red, blue, and matte black.",
"Free local delivery applies to orders over $75.",
]
def embed(texts: list[str]) -> np.ndarray: # turn text into vectors
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return np.array([d.embedding for d in resp.data])
DOC_VECTORS = embed(KNOWLEDGE) # embed knowledge once
def retrieve(question: str, top_k: int = 2) -> str: # find relevant facts
q = embed([question])[0]
scores = DOC_VECTORS @ q / (np.linalg.norm(DOC_VECTORS, axis=1) * np.linalg.norm(q))
return "\n".join(KNOWLEDGE[i] for i in scores.argsort()[::-1][:top_k])
history = [{"role": "system", "content": "You are a concise bike-shop assistant. "
"Answer only from the provided context; otherwise say you don't know."}]
def reply(question: str, retries: int = 3) -> str: # one full turn
context = retrieve(question)
history.append({"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"})
for attempt in range(retries):
try:
out = client.chat.completions.create(model=CHAT_MODEL, messages=history, temperature=0)
answer = out.choices[0].message.content
history.append({"role": "assistant", "content": answer}) # remember reply
return answer
except RateLimitError:
time.sleep(2 ** attempt) # exponential backoff
return "Sorry, I'm having trouble right now. Please try again shortly."
if __name__ == "__main__": # simple terminal loop
print("Bike-shop bot ready. Type 'quit' to exit.")
while True:
msg = input("You: ").strip()
if msg.lower() in {"quit", "exit"}:
break
print("Bot:", reply(msg))
That is a real chatbot: roughly forty lines, no framework, grounded in your data, with memory and error handling. Everything past this point is depth on one of these four building blocks.
Next steps
Pick the feature your project needs next and follow its dedicated guide:
- Make it feel instant. Show the reply word by word instead of after a pause with Stream Chatbot Responses with Python.
- Make memory durable. Move history out of a Python list and into a database keyed by user session with Add Memory to a Python Chatbot.
- Scale retrieval. Replace the
numpysearch with a real document pipeline in Connect a Chatbot to Your Docs with RAG. - Productise it. Add routing, fallbacks, and a framework layer with Build a Customer Support Chatbot with LangChain.
When your bot is solid, wire it into the rest of your stack: feed conversations into your pipeline with CRM Data Integration with AI, or package it as a paid product with SaaS MVP with Python and AI.
Back to Building AI-Powered Business Applications.