Business Apps

Add Memory to a Python Chatbot

This guide shows you how to give a Python chatbot real conversation memory in under fifteen minutes. By the end, your bot will remember earlier turns, stay inside a token budget so it never crashes, and compress long chats into a rolling summary.

If you have ever built a quick chatbot and watched it forget your name one message later, the cause is simple: a chat API call is stateless, meaning the model only sees exactly what you send in that one request. It has no hidden memory of previous calls. Send only the latest question and the model answers in a vacuum. Memory is not a feature you switch on; it is a habit your code builds by resending the conversation each time. This guide builds that habit step by step, then upgrades it so long conversations stay cheap and fast.

Prerequisites

You only need a working Python setup and an API key. If you are starting from zero, follow Create a Python Virtual Environment for AI first, then come back here.

Install the two packages used below. The tiktoken package counts tokens the same way the model does, so your budgets are accurate.

pip install openai tiktoken python-dotenv

Create a .env file in your project folder with your key:

OPENAI_API_KEY=sk-your-key-here

Add .env to your .gitignore so your key never lands in version control.

If you are still fuzzy on how chat messages and roles work, the section on Understanding LLM APIs explains the request format this guide builds on.

Step 1: Keep a running message history

Memory starts as a plain Python list. Every entry is a dictionary with a role (one of system, user, or assistant) and the content text. The trick is to append both sides of every exchange and resend the entire list on each call.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# The system message sets behavior and stays at the top for the whole chat.
messages = [
    {"role": "system", "content": "You are a friendly assistant. Be concise."}
]


def chat(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply


print(chat("Hi, my name is Sam."))
print(chat("What's my name?"))  # It now answers "Sam" because history was resent.

The second call works because messages carried the first exchange along with the new question. Remove the two append lines and the bot forgets instantly. That is the whole idea of memory: keep the list, resend the list.

Step 2: Count tokens and trim to a budget

A growing list has a hard ceiling. Every model has a context window, the maximum number of tokens (chunks of text, roughly four characters each) it can read at once. Push past it and you get a context-length error and a failed call. Even before that limit, longer histories cost more, because you pay for input tokens on every turn.

The fix is a token budget: pick a comfortable ceiling, measure the history, and drop the oldest turns once you cross it. Always keep the system message and trim from the front.

import tiktoken

ENCODING = tiktoken.get_encoding("o200k_base")  # used by gpt-4o family models


def count_tokens(msgs: list[dict]) -> int:
    # ~4 tokens of overhead per message wrap the role and formatting.
    return sum(len(ENCODING.encode(m["content"])) + 4 for m in msgs)


def trim_to_budget(msgs: list[dict], max_tokens: int = 3000) -> list[dict]:
    system = msgs[0]
    rest = msgs[1:]
    while rest and count_tokens([system] + rest) > max_tokens:
        rest.pop(0)  # drop the oldest non-system turn
    return [system] + rest

Call trim_to_budget right before each API request. The bot now stays inside the window forever, no matter how long the chat runs. The downside is honest: trimmed turns are gone, so the bot may forget early details. Step 3 fixes that.

Step 3: Add rolling-summary memory

Trimming protects you from crashes but throws away context. Rolling-summary memory keeps the best of both worlds: instead of deleting old turns, you ask the model to compress them into a short recap, then store that recap in the system context. The bot keeps a running gist of the whole conversation plus the most recent turns word for word.

def summarize(old_msgs: list[dict], prior_summary: str = "") -> str:
    transcript = "\n".join(f"{m['role']}: {m['content']}" for m in old_msgs)
    prompt = (
        "Update the running summary of this conversation. "
        "Keep names, decisions, and open questions. Be under 120 words.\n\n"
        f"Previous summary:\n{prior_summary or '(none)'}\n\n"
        f"New messages:\n{transcript}"
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

When the history grows too large, peel off the oldest turns, feed them to summarize, and fold the result back into the system message. The recap costs a fraction of the tokens the raw turns would, so the bot remembers a long chat without resending all of it.

Step 4: Wrap it in a reusable chat loop

Now combine the pieces into one small class you can drop into any project. It holds the recent messages, the running summary, and the budget logic in one place.

class ChatMemory:
    def __init__(self, system: str, max_tokens: int = 3000, keep_recent: int = 6):
        self.base_system = system
        self.summary = ""
        self.recent: list[dict] = []
        self.max_tokens = max_tokens
        self.keep_recent = keep_recent  # turns to keep verbatim after summarizing

    def _system_message(self) -> dict:
        text = self.base_system
        if self.summary:
            text += f"\n\nConversation so far:\n{self.summary}"
        return {"role": "system", "content": text}

    def _maybe_summarize(self) -> None:
        msgs = [self._system_message()] + self.recent
        if count_tokens(msgs) <= self.max_tokens:
            return
        # Summarize everything except the most recent turns, then drop them.
        to_summarize = self.recent[: -self.keep_recent] or self.recent[:1]
        self.summary = summarize(to_summarize, self.summary)
        self.recent = self.recent[len(to_summarize):]

    def ask(self, user_text: str) -> str:
        self.recent.append({"role": "user", "content": user_text})
        self._maybe_summarize()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[self._system_message()] + self.recent,
        )
        reply = response.choices[0].message.content
        self.recent.append({"role": "assistant", "content": reply})
        return reply


if __name__ == "__main__":
    bot = ChatMemory(system="You are a friendly assistant. Be concise.")
    print("Chatbot ready. Type 'quit' to exit.")
    while True:
        text = input("\nYou: ").strip()
        if text.lower() in {"quit", "exit"}:
            break
        if not text:
            continue
        print(f"\nBot: {bot.ask(text)}")

Run it, chat for a while, and watch it stay responsive past the point a naive bot would crash or forget. To keep that memory after the program closes, save bot.summary and bot.recent to a file or database between sessions.

Key parameters

ParameterTypeDefaultEffect
max_tokensint3000Token ceiling for system plus recent messages before a summary is triggered. Raise it for richer memory, lower it to cut cost.
keep_recentint6How many of the latest turns stay word for word after summarizing. Higher keeps more verbatim detail.
modelstr"gpt-4o-mini"The chat model. Cheap models are fine for both replies and summaries.

Troubleshooting

  1. openai.BadRequestError mentioning maximum context length. Your history outgrew the window. Make sure trim_to_budget or _maybe_summarize runs before every call, and lower max_tokens to leave headroom for the model's reply. See Fix the Context-Length-Exceeded Error in Python for the full breakdown.
  2. The bot still forgets things after summarizing. Your summary prompt is dropping key facts. Tell it explicitly to preserve names, decisions, and numbers, and raise keep_recent so more recent turns stay verbatim.
  3. KeyError: 'OPENAI_API_KEY'. Your key did not load. Confirm .env sits in the folder you run the script from and that load_dotenv() runs before you read the variable.
  4. Token counts look wrong or off by a lot. You picked the wrong encoding. Use o200k_base for the gpt-4o family; older models use cl100k_base. A mismatched encoding makes your budget unreliable.

When to use this vs. alternatives

  • Full message history (Step 1) is best for short, self-contained chats: support tickets, quick Q&A, demos. It is the simplest and most faithful, but cost and risk grow with every turn, so it does not suit long sessions.
  • Rolling-summary memory (Steps 3-4) fits long, ongoing conversations where the gist matters more than every word: coaching bots, multi-step assistants, tutors. It keeps cost flat, at the price of some lost detail in the summarized parts.
  • Vector memory is the right tool when the bot must recall specific facts from a large knowledge base or many past sessions, rather than just the current thread. Instead of resending text, you retrieve only the relevant chunks per question. That is the approach in Connect a Chatbot to Your Docs with RAG.

Many production bots combine all three: a rolling summary for the live thread, vector memory for long-term recall, and recent turns kept verbatim.

Next steps

With memory in place, make replies feel instant by sending tokens as they arrive in Stream Chatbot Responses with Python, then give your bot real knowledge to draw on in Connect a Chatbot to Your Docs with RAG. Back to Custom AI Chatbot Development.