What does stream=True actually do in the OpenAI SDK?

It tells the API to send the reply in small pieces called tokens as the model generates them, instead of waiting for the whole answer. Your code receives a stream of chunks you can print or forward the moment each one arrives, so the user sees text appear live.

Does streaming make the chatbot answer faster?

No, the total time to finish is about the same. Streaming changes when the user sees the first words: instead of staring at a blank screen for several seconds, they see text within a fraction of a second. It only improves the feeling of speed, not the real speed.

How do I stream responses to a web browser from Python?

Expose a FastAPI endpoint that returns a StreamingResponse driven by a generator. Yield each token as a Server-Sent Events line so the browser receives them one at a time over a single open connection. The browser reads them with the EventSource API.

Can I still count tokens or get the full reply when streaming?

Yes. Collect each chunk into a string as it arrives and you end up with the complete reply once the stream finishes. For usage totals, pass stream_options with include_usage set to true and read the usage field from the final chunk.

Why is my streamed text arriving all at once instead of gradually?

Something between the model and the user is buffering. Common causes are a proxy or browser buffering small responses, or forgetting to flush output in a CLI. Disable buffering, flush after each token, and send an early padding byte for browsers.

Stream Chatbot Responses with Python

This guide shows you how to make your chatbot's replies appear word by word in under fifteen minutes, both in a terminal and over a web connection. By the end you will have a command-line bot that types its answer live and a FastAPI endpoint a browser can read token by token.

The reason to bother is one thing: how the wait feels. A blocking call returns nothing until the entire answer is ready, so the user watches a frozen screen for two, three, four seconds. Streaming sends the reply in tiny pieces — called tokens, roughly word fragments — the instant the model produces each one. The first words land in a fraction of a second, so the bot feels alive even though the total time to finish is unchanged. That single difference is why every polished chat product streams.

This guide builds directly on the chatbot from Custom AI Chatbot Development. If you have not sent a basic chat message from Python yet, start there, then come back to add live output.

Prerequisites

You need Python 3.10 or newer (python3 --version to check) and an OpenAI API key. If Python or virtual environments are new, Setting Up Python for AI and Create a Python Virtual Environment for AI cover both.

Work inside a virtual environment and install the two libraries this guide adds on top of the base chatbot — the openai SDK and FastAPI with its built-in server:

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install "openai>=1.40" python-dotenv "fastapi>=0.110" "uvicorn[standard]>=0.29"

Store your key in a .env file so it never ends up in your code:

# .env
OPENAI_API_KEY=sk-your-real-key-here
CHAT_MODEL=gpt-4o-mini

Add .env to your .gitignore immediately so the secret key is never committed:

echo ".env" >> .gitignore

If a request later fails with an authentication error, Fix the 401 Unauthorized Error in OpenAI Python lists every cause.

Step 1: Stream tokens in a command-line chatbot

A normal call hands you the whole reply at once. Adding stream=True flips a switch: instead of one finished answer, the SDK returns an iterator — an object you loop over — that yields a small chunk each time the model produces more text. You print each chunk the moment it lands, and the answer types itself out.

Each chunk carries its new text at chunk.choices[0].delta.content. The word delta means "the bit that changed since the last chunk." Sometimes that field is None (the very first and very last chunks carry no text), so you guard against it before printing.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()  # reads OPENAI_API_KEY from the environment

stream = client.chat.completions.create(
    model=os.getenv("CHAT_MODEL", "gpt-4o-mini"),
    messages=[
        {"role": "system", "content": "You are a concise, friendly bike-shop assistant."},
        {"role": "user", "content": "Recommend a first road bike for a commuter."},
    ],
    temperature=0.4,
    stream=True,
)

for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:                          # skip the empty opening and closing chunks
        print(token, end="", flush=True)
print()                                # final newline after the answer

Two details make the live effect work. end="" stops print from adding a line break after every token, so the words flow into one continuous answer. flush=True forces Python to push each token to the screen immediately instead of holding it in a buffer — without it, your terminal might show nothing until the whole reply is done, defeating the point. Run the script and you will watch the recommendation appear a few words at a time.

Step 2: Collect the full reply while streaming

Printing live is great for a person watching, but your program usually needs the finished text too — to store it in conversation memory, log it, or send it on. The fix is simple: build up a string as the tokens fly past. You print and save in the same loop.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()
MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")

# Conversation memory: the system message stays pinned at the front.
messages = [
    {"role": "system", "content": "You are a concise, friendly bike-shop assistant."},
]


def stream_reply(user_text: str) -> str:
    messages.append({"role": "user", "content": user_text})
    stream = client.chat.completions.create(
        model=MODEL, messages=messages, temperature=0.4, stream=True,
    )
    parts: list[str] = []
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            print(token, end="", flush=True)
            parts.append(token)        # keep every piece
    print()
    full = "".join(parts)              # the complete reply
    messages.append({"role": "assistant", "content": full})  # remember it
    return full


if __name__ == "__main__":
    print("Streaming bot ready. Type 'quit' to exit.")
    while True:
        msg = input("\nYou: ").strip()
        if msg.lower() in {"quit", "exit"}:
            break
        print("Bot: ", end="")
        stream_reply(msg)

Collecting into a list of parts and joining once at the end is faster than gluing strings together inside the loop, and it gives you the whole answer to append to messages. Because you store the assistant's reply, the bot keeps context across turns just like a blocking one. To go deeper on storing history beyond a single session, see Add Memory to a Python Chatbot.

Step 3: Serve a streaming endpoint with FastAPI

A terminal is fine for testing, but real users sit in a browser. To stream to them, you keep one HTTP connection open and push tokens down it as they arrive. The standard, dependency-free way to do this is Server-Sent Events (SSE) — a simple text format where each message is a line beginning with data: and ending in a blank line. Browsers read SSE natively with the built-in EventSource API.

In FastAPI you return a StreamingResponse fed by a generator — a function that yields values one at a time instead of returning all at once. Each token the model produces becomes one SSE message.

import os
from dotenv import load_dotenv
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

load_dotenv()
client = OpenAI()
MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
app = FastAPI()


def token_stream(question: str):
    stream = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are a concise bike-shop assistant."},
            {"role": "user", "content": question},
        ],
        temperature=0.4,
        stream=True,
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield f"data: {token}\n\n"   # one SSE message per token
    yield "data: [DONE]\n\n"             # tell the browser the answer is complete


@app.get("/chat")
def chat(q: str):
    return StreamingResponse(
        token_stream(q),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

Save this as server.py and start it with uvicorn server:app --reload. Then watch tokens arrive in your terminal with curl:

curl -N "http://127.0.0.1:8000/chat?q=Recommend%20a%20commuter%20bike"

The -N flag disables curl's own buffering so you see each token land. The media_type="text/event-stream" header is what marks the response as SSE, and X-Accel-Buffering: no asks reverse proxies like nginx not to hold the tokens back. A browser would consume the same endpoint like this:

const source = new EventSource("/chat?q=Recommend a commuter bike");
source.onmessage = (event) => {
  if (event.data === "[DONE]") { source.close(); return; }
  document.getElementById("reply").textContent += event.data;
};

Each token appends to the page the instant it arrives, giving the same live-typing effect a polished chat app has. When you are ready to put this behind real users, pair it with Rate-Limit AI API Calls in a SaaS with Python so one user cannot exhaust your quota.

Parameter quick reference

These are the settings specific to streaming. Everything else (model, temperature, messages) works exactly as it does in a blocking call.

Parameter	Type	Default	Effect
`stream`	bool	`False`	When `True`, the call returns an iterator of token chunks instead of one finished reply.
`stream_options`	dict	`None`	Pass `{"include_usage": True}` to receive a token-count summary in the final chunk.
`media_type`	str	none	Set to `"text/event-stream"` on the `StreamingResponse` so browsers treat the output as SSE.

Troubleshooting

The reply prints all at once instead of typing out. Output is being buffered. Cause: a missing flush=True in a CLI, or a proxy buffering the HTTP response. Fix: add flush=True to print, and send the X-Accel-Buffering: no header plus an early token so proxies release the stream.
TypeError: 'NoneType' object is not subscriptable or printing the word None. You read delta.content without checking it. Cause: the first and last chunks carry no text. Fix: guard with if token: before printing or appending, as in every example here.
The browser shows nothing but curl -N works. The browser is buffering a tiny first response. Cause: some browsers wait for a few hundred bytes before firing onmessage. Fix: yield a short padding comment line such as ": ok\n\n" right after the connection opens.
response.choices[0].message.content raises AttributeError when streaming. You used the blocking access path on a stream. Cause: streamed chunks expose delta, not message. Fix: read chunk.choices[0].delta.content inside the loop instead.

When to use this vs. alternatives

Stream when a person is watching a long reply. Chat windows, support bots, and anything that writes more than a sentence feel dramatically more responsive when tokens appear live, even though the total time is identical.
Stay blocking for short or machine-read replies. If the answer is a single label, a JSON object, or feeds another program rather than a human, streaming adds parsing complexity for no benefit — just take the whole reply at once.
Skip streaming when you must validate before showing anything. If you need to check, reformat, or moderate the full answer before the user sees a word, a blocking call is simpler because you have the complete text in hand before deciding what to display.

When your streaming bot also needs to answer from your own documents, layer retrieval on top with Connect a Chatbot to Your Docs with RAG — you stream the final answer exactly the same way.

Back to Custom AI Chatbot Development.

Stream Chatbot Responses with Python

Related pages in this content path

Add Memory to a Python Chatbot

Build a Customer Support Chatbot with LangChain

Connect a Chatbot to Your Docs with RAG