This guide shows you how to fix the "maximum context length exceeded" error in Python in under fifteen minutes, with runnable code for every fix.
You sent a long document, a fat chat history, or a big batch of text to a model, and instead of an answer you got a red wall of text complaining about a context length. The fix is never mysterious once you understand one rule: a model can only hold a fixed number of tokens (small chunks of text, roughly three-quarters of a word each) at one time, and that budget covers both what you send and what you ask it to write back. Go over the budget and the request is refused before the model even starts.
This is one of the Understanding LLM APIs guides, written for creators, marketers, founders, and students who can run a Python file but have never had to manage token budgets by hand. By the end you will measure tokens precisely, cut your input down to fit, and pick the right model so the error stops coming back.
The exact error you are seeing
When the total goes over the limit, the openai SDK raises a BadRequestError. The message looks like this:
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's
maximum context length is 16385 tokens. However, your messages resulted in
17421 tokens. Please reduce the length of the messages.", 'type':
'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
The two numbers are everything. 16385 is the model's window — the total token budget. 17421 is what you sent. You are 1,036 tokens over, so you need to free at least that much, plus enough room for the reply. The error code context_length_exceeded is the machine-readable name for exactly this problem.
How tokens and context windows work
A context window is the maximum number of tokens a model can read and write in a single call. Think of it as a fixed-size table: your prompt, the system instructions, the chat history, and the reply the model generates all have to sit on that table at once. Nothing spills over the edge.
The total breaks into two parts:
- Input tokens — your system message, user message, and any conversation history. The provider counts these before the model runs.
- Output tokens — the reply, capped by the
max_tokensvalue you set. The provider reserves this much space in advance.
The rule the model enforces is simple:
input_tokens + max_tokens must be <= context_window
If max_tokens is large, it eats into the room left for input. That is why the same prompt can succeed with a short reply cap and fail with a long one. Every fix below is just a way to make one side of that inequality smaller.
Quick reference: cause to fix
| What is too big | Why it happens | Fastest fix |
|---|---|---|
| Input alone over the window | One huge document or file pasted into the prompt | Chunk the document and process pieces |
| Growing chat history | Every turn is appended, so it never shrinks | Trim old turns or summarize the history |
max_tokens set too high | Reply cap reserves more space than is free | Lower max_tokens |
| Input near the limit on a small model | Model window is only a few thousand tokens | Switch to a larger-context model |
Prerequisites
You only need the openai SDK, the tiktoken token counter, and python-dotenv for your key. Confirm you are on Python 3.10 or newer, then install:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "tiktoken>=0.7" "python-dotenv>=1.0"
Store your key in a .env file and keep it out of version control:
OPENAI_API_KEY=sk-your-real-key-goes-here
echo ".env" >> .gitignore
That .gitignore line is non-negotiable — a key pushed to a public repository can be found and billed to you within minutes. If your environment is not set up yet, the Understanding LLM APIs section covers the full installation first.
Fix 1: Count tokens with tiktoken before you send
The first move is to stop guessing. tiktoken is OpenAI's own tokenizer, so it counts exactly the way the model does. Measure your messages before sending and you will know whether you are over the limit and by how much.
import tiktoken
def count_tokens(messages: list[dict], model: str = "gpt-4o-mini") -> int:
"""Count the tokens a list of chat messages will use."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base") # safe fallback
tokens = 0
for message in messages:
tokens += 4 # every message carries a few tokens of overhead
tokens += len(encoding.encode(message["content"]))
tokens += 2 # the reply is primed with a couple of tokens
return tokens
messages = [
{"role": "system", "content": "You summarize reports concisely."},
{"role": "user", "content": "Summarize this: " + "lorem ipsum " * 500},
]
input_tokens = count_tokens(messages)
print(f"Input tokens: {input_tokens}")
The per-message overhead exists because the model wraps each message in a little structure of its own. The count will not be perfect to the last token, but it is close enough to keep you safely under any window. Compare the result against your model's limit and the planned reply size before you ever make a call.
Fix 2: Trim or chunk the input
If a single document is the problem, you have two choices. Trimming keeps only the part that fits. Chunking splits the document into pieces that each fit the window, processes them one at a time, then combines the results. Chunking is the right answer when you cannot afford to throw text away.
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
def chunk_text(text: str, max_tokens: int = 3000) -> list[str]:
"""Split text into chunks that each stay under max_tokens."""
tokens = encoding.encode(text)
chunks = []
for start in range(0, len(tokens), max_tokens):
piece = tokens[start:start + max_tokens]
chunks.append(encoding.decode(piece))
return chunks
long_document = "word " * 20000
pieces = chunk_text(long_document, max_tokens=3000)
print(f"Split into {len(pieces)} chunks")
for i, piece in enumerate(pieces, start=1):
print(f"Chunk {i}: {len(encoding.encode(piece))} tokens")
# send each chunk to the model separately, then combine the replies
Encoding the whole text, slicing the token list, and decoding each slice guarantees every chunk is genuinely under the limit — measured in tokens, not characters, so it is exact. Leave headroom (here 3,000 tokens out of a larger window) for the system message and the reply.
Fix 3: Summarize the conversation history
In a chatbot, the message list grows with every turn, so a long conversation eventually overflows the window on its own. The fix is to replace the old turns with a short summary the model writes for you, keeping the meaning while shedding most of the tokens.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def summarize_history(history: list[dict], keep_last: int = 4) -> list[dict]:
"""Replace old turns with a summary, keeping the most recent ones."""
if len(history) <= keep_last:
return history
old_turns = history[:-keep_last]
transcript = "\n".join(f"{m['role']}: {m['content']}" for m in old_turns)
summary = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Summarize this chat in 3 sentences."},
{"role": "user", "content": transcript},
],
max_tokens=150,
)
summary_text = summary.choices[0].message.content
return [
{"role": "system", "content": f"Earlier conversation: {summary_text}"},
*history[-keep_last:],
]
This keeps the most recent turns verbatim, where detail matters most, and compresses everything older into three sentences. The conversation can run indefinitely without the token count creeping upward. If you are building a chatbot, the deeper version of this pattern lives in Add Memory to a Python Chatbot.
Fix 4: Lower max_tokens to reserve less reply space
Remember the rule: input_tokens + max_tokens must fit the window. When max_tokens is set high "just in case," it reserves space your input could be using. If your reply genuinely needs only a few hundred tokens, cap it there and free the rest for input.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You write one-line summaries."},
{"role": "user", "content": "Summarize: " + "data " * 1000},
],
max_tokens=60, # a one-line reply needs little space; free the rest for input
)
print(response.choices[0].message.content)
This only helps when the input itself fits. If your prompt alone already overflows the window, no max_tokens value will save you — go back to Fix 2 or Fix 3 to shrink the input first.
Fix 5: Choose a larger-context model
Some inputs are simply large and should not be cut. A long contract, a full transcript, or a research paper may need to be seen whole. The cleanest fix there is a model with a bigger window. A small model might hold around 16,000 tokens; a larger one such as gpt-4o holds about 128,000 — roughly eight times the room.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# A 128k-token window absorbs far more input in a single call.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You answer questions about documents."},
{"role": "user", "content": "Here is a long report:\n" + "fact " * 30000},
],
max_tokens=500,
)
print(response.choices[0].message.content)
A larger window is the least-effort fix, but it is not free: bigger models often cost more per token, and even a 128,000-token window has a ceiling. For genuinely huge data, pair a large model with the chunking from Fix 2. To compare windows and prices across providers, see OpenAI vs Anthropic API for Beginners and Best Free AI APIs for Beginners.
Key parameters at a glance
| Parameter | Effect on the limit |
|---|---|
model | Sets the window size. A larger-context model raises the total token budget. |
max_tokens | Reserves space for the reply. Lower it to leave more room for input. |
messages | Holds all input tokens. Trim or summarize this to shrink input. |
Troubleshooting
- The error returns even after lowering
max_tokens. Your input alone is over the window, so reserving less reply space cannot help. Count the input withtiktoken(Fix 1); if it is already near the limit, you must chunk or summarize the input rather than touchmax_tokens. tiktoken.encoding_for_modelraises aKeyErrorfor a new model. The library does not yet know that model's name. Fall back totiktoken.get_encoding("cl100k_base"), which matches most current OpenAI models, exactly as thecount_tokenshelper in Fix 1 does.- Your token count looks right but the request still fails by a few tokens. Token counting is an estimate that misses small per-message overhead. Leave a safety margin — aim to stay 5-10% under the window rather than right at the edge.
- A chunked job gives disconnected or repetitive answers. Each chunk was processed with no memory of the others. Summarize every chunk first, then make one final call that combines the summaries, so the model sees the whole picture at the end.
When to use this vs. alternatives
- Trim or summarize when the task is a chat that keeps growing — it is cheap and keeps recent detail sharp, which matters most in conversation.
- Chunk the input when you must process a large document in full and cannot drop any of it, accepting the extra calls that come with it.
- Switch to a larger-context model when the input is large, indivisible, and worth the higher per-call cost — the least code, but not the cheapest.
Back to Understanding LLM APIs.
Related guides
- Understanding LLM APIs — the main guide for this track, covering setup, keys, and parameters.
- Fix the 401 Unauthorized Error in OpenAI Python — when your key is missing or mistyped.
- Fix the 429 Rate-Limit Error in Python — when you send calls faster than your tier allows.
- Fix JSONDecodeError with AI API Responses in Python — when the model's reply will not parse as JSON.