Business Apps

Stream Chatbot Responses with Python

This guide shows you how to make your chatbot's replies appear word by word in under fifteen minutes, both in a terminal and over a web connection. By the end you will have a command-line bot that types its answer live and a FastAPI endpoint a browser can read token by token.

The reason to bother is one thing: how the wait feels. A blocking call returns nothing until the entire answer is ready, so the user watches a frozen screen for two, three, four seconds. Streaming sends the reply in tiny pieces — called tokens, roughly word fragments — the instant the model produces each one. The first words land in a fraction of a second, so the bot feels alive even though the total time to finish is unchanged. That single difference is why every polished chat product streams.

This guide builds directly on the chatbot from Custom AI Chatbot Development. If you have not sent a basic chat message from Python yet, start there, then come back to add live output.

Prerequisites

You need Python 3.10 or newer (python3 --version to check) and an OpenAI API key. If Python or virtual environments are new, Setting Up Python for AI and Create a Python Virtual Environment for AI cover both.

Work inside a virtual environment and install the two libraries this guide adds on top of the base chatbot — the openai SDK and FastAPI with its built-in server:

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install "openai>=1.40" python-dotenv "fastapi>=0.110" "uvicorn[standard]>=0.29"

Store your key in a .env file so it never ends up in your code:

# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini

Add .env to your .gitignore immediately so the secret key is never committed:

echo ".env" >> .gitignore

If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python lists every cause.

Step 1: Stream tokens in a command-line chatbot

A normal call hands you the whole reply at once. Adding stream=True flips a switch: instead of one finished answer, the SDK returns an iterator — an object you loop over — that yields a small chunk each time the model produces more text. You print each chunk the moment it lands, and the answer types itself out.

Each chunk carries its new text at chunk.choices[0].delta.content. The word delta means "the bit that changed since the last chunk." Sometimes that field is None (the very first and very last chunks carry no text), so you guard against it before printing.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()  # reads OPENAI_API_KEY from the environment

stream = client.chat.completions.create(
    model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
    messages=[
        {"role": "system", "content": "You are a concise, friendly bike-shop assistant."},
        {"role": "user", "content": "Recommend a first road bike for a commuter."},
    ],
    temperature=0.4,
    stream=True,
)

for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:                          # skip the empty opening and closing chunks
        print(token, end="", flush=True)
print()                                # final newline after the answer

Two details make the live effect work. end="" stops print from adding a line break after every token, so the words flow into one continuous answer. flush=True forces Python to push each token to the screen immediately instead of holding it in a buffer — without it, your terminal might show nothing until the whole reply is done, defeating the point. Run the script and you will watch the recommendation appear a few words at a time.

Step 2: Collect the full reply while streaming

Printing live is great for a person watching, but your program usually needs the finished text too — to store it in conversation memory, log it, or send it on. The fix is simple: build up a string as the tokens fly past. You print and save in the same loop.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()
MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")

# Conversation memory: the system message stays pinned at the front.
messages = [
    {"role": "system", "content": "You are a concise, friendly bike-shop assistant."},
]


def stream_reply(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
    stream = client.chat.completions.create(
        model=MODEL, messages=messages, temperature=0.4, stream=True,
    )
    parts: list[str] = []
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            print(token, end="", flush=True)
            parts.append(token)        # keep every piece
    print()
    full = "".join(parts)              # the complete reply
    messages.append({"role": "assistant", "content": full})  # remember it
    return full


if __name__ == "__main__":
    print("Streaming bot ready. Type 'quit' to exit.")
    while True:
        msg = input("\nYou: ").strip()
        if msg.lower() in {"quit", "exit"}:
            break
        print("Bot: ", end="")
        stream_reply(msg)

Collecting into a list of parts and joining once at the end is faster than gluing strings together inside the loop, and it gives you the whole answer to append to messages. Because you store the assistant's reply, the bot keeps context across turns just like a blocking one. To go deeper on storing history beyond a single session, see Add Memory to a Python Chatbot.

Step 3: Serve a streaming endpoint with FastAPI

A terminal is fine for testing, but real users sit in a browser. To stream to them, you keep one HTTP connection open and push tokens down it as they arrive. The standard, dependency-free way to do this is Server-Sent Events (SSE) — a simple text format where each message is a line beginning with data: and ending in a blank line. Browsers read SSE natively with the built-in EventSource API.

In FastAPI you return a StreamingResponse fed by a generator — a function that yields values one at a time instead of returning all at once. Each token the model produces becomes one SSE message.

import os
from dotenv import load_dotenv
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

load_dotenv()
client = OpenAI()
MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
app = FastAPI()


def token_stream(question: str):
    stream = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are a concise bike-shop assistant."},
            {"role": "user", "content": question},
        ],
        temperature=0.4,
        stream=True,
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield f"data: {token}\n\n"   # one SSE message per token
    yield "data: [DONE]\n\n"             # tell the browser the answer is complete


@app.get("/chat")
def chat(q: str):
    return StreamingResponse(
        token_stream(q),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

Save this as server.py and start it with uvicorn server:app --reload. Then watch tokens arrive in your terminal with curl:

curl -N "http://127.0.0.1:8000/chat?q=Recommend%20a%20commuter%20bike"

The -N flag disables curl's own buffering so you see each token land. The media_type="text/event-stream" header is what marks the response as SSE, and X-Accel-Buffering: no asks reverse proxies like nginx not to hold the tokens back. A browser would consume the same endpoint like this:

const source = new EventSource("/chat?q=Recommend a commuter bike");
source.onmessage = (event) => {
  if (event.data === "[DONE]") { source.close(); return; }
  document.getElementById("reply").textContent += event.data;
};

Each token appends to the page the instant it arrives, giving the same live-typing effect a polished chat app has. When you are ready to put this behind real users, pair it with Rate-Limit AI API Calls in a SaaS with Python so one user cannot exhaust your quota.

Parameter quick reference

These are the settings specific to streaming. Everything else (model, temperature, messages) works exactly as it does in a blocking call.

ParameterTypeDefaultEffect
streamboolFalseWhen True, the call returns an iterator of token chunks instead of one finished reply.
stream_optionsdictNonePass {"include_usage": True} to receive a token-count summary in the final chunk.
media_typestrnoneSet to "text/event-stream" on the StreamingResponse so browsers treat the output as SSE.

Troubleshooting

  1. The reply prints all at once instead of typing out. Output is being buffered. Cause: a missing flush=True in a CLI, or a proxy buffering the HTTP response. Fix: add flush=True to print, and send the X-Accel-Buffering: no header plus an early token so proxies release the stream.
  2. TypeError: 'NoneType' object is not subscriptable or printing the word None. You read delta.content without checking it. Cause: the first and last chunks carry no text. Fix: guard with if token: before printing or appending, as in every example here.
  3. The browser shows nothing but curl -N works. The browser is buffering a tiny first response. Cause: some browsers wait for a few hundred bytes before firing onmessage. Fix: yield a short padding comment line such as ": ok\n\n" right after the connection opens.
  4. response.choices[0].message.content raises AttributeError when streaming. You used the blocking access path on a stream. Cause: streamed chunks expose delta, not message. Fix: read chunk.choices[0].delta.content inside the loop instead.

When to use this vs. alternatives

  • Stream when a person is watching a long reply. Chat windows, support bots, and anything that writes more than a sentence feel dramatically more responsive when tokens appear live, even though the total time is identical.
  • Stay blocking for short or machine-read replies. If the answer is a single label, a JSON object, or feeds another program rather than a human, streaming adds parsing complexity for no benefit — just take the whole reply at once.
  • Skip streaming when you must validate before showing anything. If you need to check, reformat, or moderate the full answer before the user sees a word, a blocking call is simpler because you have the complete text in hand before deciding what to display.

When your streaming bot also needs to answer from your own documents, layer retrieval on top with Connect a Chatbot to Your Docs with RAG — you stream the final answer exactly the same way.

Back to Custom AI Chatbot Development.