Do I need machine learning skills to build a custom AI chatbot in Python?

No. You call a hosted model like GPT-4o-mini over an API, so the model is already trained. You only need basic Python to send messages and handle the replies. Most of the work is wiring, prompts, and memory, not training a model.

How much does it cost to run a Python chatbot?

With a small model such as gpt-4o-mini, a typical short exchange costs a fraction of a cent. Costs scale with how much text you send and receive, so trimming conversation history and using small models keeps spend low. You can set hard usage caps in your provider dashboard.

What is the difference between a chatbot with memory and a stateless one?

A stateless chatbot forgets every previous turn, so each message is answered in isolation. A chatbot with memory resends past turns so the model can refer back to them, which makes follow-up questions like 'and what about the cheaper plan?' work correctly.

Do I have to use LangChain to build a chatbot?

No. You can build a complete, production-ready chatbot with just the openai SDK and a Python list for history. LangChain helps when you add retrieval, tools, or many chained steps, but it adds a learning curve, so start without it.

How do I stop my chatbot from making up answers?

Ground it in your own documents using retrieval, instruct it in the system prompt to say 'I don't know' when the context lacks an answer, and keep the temperature low. Retrieval plus a strict system prompt removes most invented answers.

Custom AI Chatbot Development with Python

A chatbot you build yourself does things the drag-and-drop platforms never will: it answers from your documents, follows your tone, plugs into your tools, and never charges you per seat. The catch is that most tutorials jump straight to heavyweight frameworks and leave you with a black box you cannot debug. This guide does the opposite. You will build a working chatbot from a single API call, then add the three features that separate a toy from something you can put in front of customers: memory, retrieval, and graceful error handling.

You do not need a machine learning background. The model you will use is already trained and hosted by a provider; your job is to send it the right messages and handle the replies. If you can write a for loop and read a dictionary, you can finish this guide. By the end you will have a runnable chatbot script and a clear map of which child guide to read next for each feature you want to deepen.

This is one section of Building AI-Powered Business Applications. If you are brand new to calling models from Python, read Understanding LLM APIs first to see how requests, keys, and responses fit together.

What a chatbot actually does

Before the code, hold one picture in your head. A chatbot is a loop. The user sends a message, your Python app gathers some context (the conversation so far, maybe a few relevant document snippets), sends all of it to the model, and shows the reply back to the user. Everything else in this guide is just making each part of that loop smarter.

The single most important thing to understand is that the model is stateless. It has no memory between calls and no live connection to your business. Each API request is a clean slate: it sees only the text you send in that one request and nothing else. That sounds like a limitation, but it is actually freeing, because it means every "smart" feature reduces to the same job — deciding what text to put into the next request. Memory is text you re-send. Retrieval is text you look up and attach. Tools are descriptions of functions you include. Once that clicks, the rest of this guide is mechanical.

Every chatbot is the same loop: gather context, call the model, return the reply. Memory and retrieval just enrich what you send.

Prerequisites

You need Python 3.10 or newer. Check with python3 --version. If that command fails or shows an older version, follow Setting Up Python for AI first.

Work inside a virtual environment so this project's packages stay isolated from the rest of your system. If virtual environments are new to you, Create a Python Virtual Environment for AI walks through it.

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "httpx>=0.27" python-dotenv numpy

You also need an API key from your model provider. Create a .env file in your project folder to hold it so it never gets pasted into code:

# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini
EMBED_MODEL=text-embedding-3-small

Add .env to your .gitignore immediately so your secret key is never committed to version control. One leaked key can run up a real bill.

echo ".env" >> .gitignore

If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python covers every cause.

Step 1: Send your first chat message

Start with the smallest possible chatbot: one message in, one reply out. The openai SDK reads your key from the environment automatically, so you only describe what you want, not how to authenticate.

Two ideas to know. The system message sets the bot's personality and rules; the user never sees it. The user message is what the person typed. You send both as a list of dictionaries, each with a role and content.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()  # reads OPENAI_API_KEY from the environment

response = client.chat.completions.create(
    model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
    messages=[
        {"role": "system", "content": "You are a concise, friendly support assistant for a bike shop."},
        {"role": "user", "content": "Do you sell helmets for kids?"},
    ],
    temperature=0.3,
)

print(response.choices[0].message.content)

Run it. You should see a short, on-brand answer. The reply lives at response.choices[0].message.content — that nesting trips up everyone once, so commit it to memory. The choices part is a list because the model can return several alternative answers in one call; you almost always want the first one. temperature=0.3 keeps the answer focused; raise it toward 1.0 for more creative, varied wording.

Why a small model like gpt-4o-mini? For a support bot, speed and cost matter more than raw reasoning power, and a small model answers a grounded question just as well as a large one for a fraction of the price. Start small; only reach for a bigger model if you see the bot fumbling genuinely hard reasoning. If you are still deciding which provider and model to start with, Best Free AI APIs for Beginners compares the options without commitment.

The system prompt does a lot of heavy lifting here — it is where you set tone, scope, and the rules the bot must never break. A vague system prompt produces a vague, rambling bot, so be specific about what it should and should not do. To learn how to make it reliably enforce tone, format, and refusals, read Write System Prompts that Control Output Format.

Step 2: Add conversation memory

The call above forgets everything the instant it returns. Ask "what colours does it come in?" next and the model has no idea what "it" means. The fix is simple: keep a growing list of messages and resend the whole list every turn. The model has no hidden memory of its own — you are the memory.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()

# The running history. The system message stays pinned at the front.
messages = [
    {"role": "system", "content": "You are a concise, friendly support assistant for a bike shop."},
]


def ask(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
    response = client.chat.completions.create(
        model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
        messages=messages,
        temperature=0.3,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})  # remember the answer too
    return reply


print(ask("Do you sell helmets for kids?"))
print(ask("What colours does it come in?"))  # 'it' now resolves correctly

Notice that you append both the user message and the assistant's reply. If you only stored user turns, the bot would lose its own answers and contradict itself. This list-based approach is everything memory really is. The catch is that the list grows forever, and every model has a limit on how much text it can read at once. When a long chat eventually hits that wall you will see a context-length error; Fix the Context-Length-Exceeded Error in Python shows how to trim or summarise old turns. For production-grade strategies like storing history in a database keyed by session, see Add Memory to a Python Chatbot.

Step 3: Ground answers in your own data with retrieval

A general model knows nothing about your return policy or your product catalogue, and it will happily invent an answer rather than admit it. Retrieval fixes this. Before answering, you find the snippets of your own documents most relevant to the question and paste them into the prompt as context. This pattern is called RAG — Retrieval-Augmented Generation.

The matching step uses embeddings: numeric fingerprints of text where similar meanings sit close together in mathematical space. The phrase "kids' helmet colours" and the sentence "Kids' helmets come in red, blue, and matte black" produce vectors that point in nearly the same direction, even though they share few exact words. That is the magic — embeddings match on meaning, not keywords, so a customer does not have to phrase their question exactly the way your document is written.

The workflow is three steps: embed your documents once and keep the vectors, embed the user's question at query time, then measure which document vectors point most nearly the same way as the question vector. Here is a minimal, dependency-light version using numpy for the similarity maths so you can see exactly what happens with no framework hiding the logic.

import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")

# Your knowledge base. In a real app these come from files or a database.
docs = [
    "Returns are accepted within 30 days with a receipt.",
    "Kids' helmets come in red, blue, and matte black.",
    "We offer free local delivery on orders over $75.",
]


def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([d.embedding for d in resp.data])


doc_vectors = embed(docs)  # embed the knowledge base once at startup


def retrieve(question: str, top_k: int = 2) -> list[str]:
    q_vector = embed([question])[0]
    # cosine similarity = dot product of unit-normalised vectors
    scores = doc_vectors @ q_vector / (
        np.linalg.norm(doc_vectors, axis=1) * np.linalg.norm(q_vector)
    )
    best = scores.argsort()[::-1][:top_k]
    return [docs[i] for i in best]


def answer_with_context(question: str) -> str:
    context = "\n".join(retrieve(question))
    response = client.chat.completions.create(
        model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
        messages=[
            {"role": "system", "content": (
                "Answer using only the context below. "
                "If the answer is not in the context, say you don't know.\n\n"
                f"Context:\n{context}"
            )},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


print(answer_with_context("What colours do the kids helmets come in?"))

The system prompt does the safety work: it tells the model to answer only from the context and to admit ignorance otherwise. That single instruction removes most invented answers. For a few documents, numpy is plenty; once you have thousands of snippets you will want a proper vector store. The full pipeline — chunking files, persisting vectors, and scaling search — lives in Connect a Chatbot to Your Docs with RAG.

Step 4: Handle errors and rate limits

A chatbot that crashes on the first network hiccup is not ready for anyone. Three things go wrong in the real world: the network drops, the provider is briefly overloaded (a rate limit), and the model returns something your code did not expect. Wrap your call so a single failure degrades into a polite message instead of a stack trace, and retry the temporary ones.

import os
import time
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError, APIError

load_dotenv()
client = OpenAI(timeout=20.0)  # never wait forever on a slow response


def safe_chat(messages: list[dict], retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
                messages=messages,
                temperature=0.3,
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait = 2 ** attempt  # back off: 1s, 2s, 4s
            print(f"Rate limited, retrying in {wait}s...")
            time.sleep(wait)
        except APIError as exc:
            print(f"API error: {exc}")
            break
    return "Sorry, I'm having trouble right now. Please try again in a moment."

The 2 ** attempt pattern is exponential backoff: each retry waits longer (one second, then two, then four), giving the provider room to recover instead of hammering it with instant retries that would only deepen the overload. Notice the two error types are handled differently: a RateLimitError is temporary, so you retry it, while a general APIError usually signals a real problem with your request, so you stop and surface it rather than looping uselessly. Setting timeout=20.0 on the client means a stuck request fails fast rather than freezing your whole app while one user waits on a hung connection. If rate limits become a regular problem rather than a rare blip, Fix the 429 Rate-Limit Error in Python explains the limits and how to stay under them, and Rate-Limit AI API Calls in a SaaS with Python shows how to throttle your own users fairly.

Plain SDK or LangChain?

You have now built a complete chatbot with nothing but the openai SDK, numpy, and Python lists. That is deliberate: you can see and debug every moving part, and there is no framework version churn to chase. For most business bots, this is all you ever need.

LangChain earns its place when your loop grows complicated — when the bot must decide between several tools, chain many steps, swap between providers behind one interface, or use prebuilt retrieval and memory components so you write less plumbing. The trade is a steeper learning curve and more abstraction between you and the API, which can make bugs harder to trace. A practical rule: start with the plain SDK as shown here, and reach for LangChain only when you feel yourself rebuilding its features by hand. The fully framework-based version, with routing and fallbacks, is covered in Build a Customer Support Chatbot with LangChain.

Parameter reference

These are the settings you will reach for most when calling the chat endpoint. Tune temperature first; leave the rest at their defaults until you have a reason to change them.

Parameter	Type	Default	Effect
`model`	str	none (required)	Which model answers. `gpt-4o-mini` is cheap and fast; larger models reason better at higher cost.
`messages`	listdict	none (required)	The full conversation: system, user, and assistant turns in order.
`temperature`	float	`1.0`	Randomness. `0` is deterministic and factual; `1.0`+ is creative and varied. Use low values for support bots.
`max_tokens`	int	model max	Hard cap on reply length. Set it to control cost and stop runaway answers.
`top_p`	float	`1.0`	Alternative to temperature that limits word choice to the most likely options. Change one, not both.
`timeout`	float	none	Seconds to wait before giving up on a request. Set it on the client so a slow call cannot hang your app.
`stream`	bool	`False`	When `True`, tokens arrive as they are generated for a live typing effect.

Troubleshooting

AuthenticationError: No API key provided — Your key is not loaded. Cause: load_dotenv() was not called, or .env is in the wrong folder. Fix: call load_dotenv() before creating the client and run your script from the folder that holds .env.
RateLimitError: 429 — You sent requests faster than your plan allows, or you are out of credit. Cause: a loop firing calls with no pause, or an empty balance. Fix: add the exponential backoff from Step 4 and check your billing dashboard.
BadRequestError: maximum context length — The conversation plus context is too long for the model. Cause: an ever-growing messages list. Fix: keep only the last several turns, or summarise older ones into one short note.
AttributeError: 'NoneType' object has no attribute ... — You read the reply from the wrong place. Cause: the reply is at response.choices[0].message.content, not response.content. Fix: use the full path.
The bot answers from general knowledge instead of your documents — Retrieval context was empty or ignored. Cause: no snippets matched, or the system prompt did not forbid outside answers. Fix: confirm retrieve() returns text and add "answer only from the context" to the system message.
httpx.ReadTimeout — The request took longer than your timeout. Cause: a large request or a slow network. Fix: raise the client timeout, shorten the prompt, or switch to a faster model.

Worked example: a complete chatbot

This script ties every step together into one runnable program: it loads a small knowledge base, retrieves relevant context per question, remembers the conversation, and survives errors with retries. Save it as chatbot.py, make sure your .env is in place, and run python chatbot.py.

import os
import time
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError

load_dotenv()                                              # load keys from .env
client = OpenAI(timeout=20.0)                              # fail fast on slow calls
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")

KNOWLEDGE = [                                              # your facts live here
    "Returns are accepted within 30 days with a receipt.",
    "Kids' helmets come in red, blue, and matte black.",
    "Free local delivery applies to orders over $75.",
]


def embed(texts: list[str]) -> np.ndarray:                # turn text into vectors
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([d.embedding for d in resp.data])


DOC_VECTORS = embed(KNOWLEDGE)                             # embed knowledge once


def retrieve(question: str, top_k: int = 2) -> str:       # find relevant facts
    q = embed([question])[0]
    scores = DOC_VECTORS @ q / (np.linalg.norm(DOC_VECTORS, axis=1) * np.linalg.norm(q))
    return "\n".join(KNOWLEDGE[i] for i in scores.argsort()[::-1][:top_k])


history = [{"role": "system", "content": "You are a concise bike-shop assistant. "
           "Answer only from the provided context; otherwise say you don't know."}]


def reply(question: str, retries: int = 3) -> str:        # one full turn
    context = retrieve(question)
    history.append({"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"})
    for attempt in range(retries):
        try:
            out = client.chat.completions.create(model=CHAT_MODEL, messages=history, temperature=0)
            answer = out.choices[0].message.content
            history.append({"role": "assistant", "content": answer})  # remember reply
            return answer
        except RateLimitError:
            time.sleep(2 ** attempt)                       # exponential backoff
    return "Sorry, I'm having trouble right now. Please try again shortly."


if __name__ == "__main__":                                # simple terminal loop
    print("Bike-shop bot ready. Type 'quit' to exit.")
    while True:
        msg = input("You: ").strip()
        if msg.lower() in {"quit", "exit"}:
            break
        print("Bot:", reply(msg))

That is a real chatbot: roughly forty lines, no framework, grounded in your data, with memory and error handling. Everything past this point is depth on one of these four building blocks.

Next steps

Pick the feature your project needs next and follow its dedicated guide:

Make it feel instant. Show the reply word by word instead of after a pause with Stream Chatbot Responses with Python.
Make memory durable. Move history out of a Python list and into a database keyed by user session with Add Memory to a Python Chatbot.
Scale retrieval. Replace the numpy search with a real document pipeline in Connect a Chatbot to Your Docs with RAG.
Productise it. Add routing, fallbacks, and a framework layer with Build a Customer Support Chatbot with LangChain.

When your bot is solid, wire it into the rest of your stack: feed conversations into your pipeline with CRM Data Integration with AI, or package it as a paid product with SaaS MVP with Python and AI.

Back to Building AI-Powered Business Applications.

Custom AI Chatbot Development: A Step-by-Step Python Guide

Related pages in this content path

Add Memory to a Python Chatbot

Build a Customer Support Chatbot with LangChain

Connect a Chatbot to Your Docs with RAG

Stream Chatbot Responses with Python