Fundamentals

Fix the Context-Length-Exceeded Error in Python

This guide shows you how to fix the "maximum context length exceeded" error in Python in under fifteen minutes, with runnable code for every fix.

You sent a long document, a fat chat history, or a big batch of text to a model, and instead of an answer you got a red wall of text complaining about a context length. The fix is never mysterious once you understand one rule: a model can only hold a fixed number of tokens (small chunks of text, roughly three-quarters of a word each) at one time, and that budget covers both what you send and what you ask it to write back. Go over the budget and the request is refused before the model even starts.

This is one of the Understanding LLM APIs guides, written for creators, marketers, founders, and students who can run a Python file but have never had to manage token budgets by hand. By the end you will measure tokens precisely, cut your input down to fit, and pick the right model so the error stops coming back.

The exact error you are seeing

When the total goes over the limit, the openai SDK raises a BadRequestError. The message looks like this:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's
maximum context length is 16385 tokens. However, your messages resulted in
17421 tokens. Please reduce the length of the messages.", 'type':
'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

The two numbers are everything. 16385 is the model's window — the total token budget. 17421 is what you sent. You are 1,036 tokens over, so you need to free at least that much, plus enough room for the reply. The error code context_length_exceeded is the machine-readable name for exactly this problem.

How tokens and context windows work

A context window is the maximum number of tokens a model can read and write in a single call. Think of it as a fixed-size table: your prompt, the system instructions, the chat history, and the reply the model generates all have to sit on that table at once. Nothing spills over the edge.

The total breaks into two parts:

  • Input tokens — your system message, user message, and any conversation history. The provider counts these before the model runs.
  • Output tokens — the reply, capped by the max_tokens value you set. The provider reserves this much space in advance.

The rule the model enforces is simple:

input_tokens + max_tokens  must be  <=  context_window

If max_tokens is large, it eats into the room left for input. That is why the same prompt can succeed with a short reply cap and fail with a long one. Every fix below is just a way to make one side of that inequality smaller.

Quick reference: cause to fix

What is too bigWhy it happensFastest fix
Input alone over the windowOne huge document or file pasted into the promptChunk the document and process pieces
Growing chat historyEvery turn is appended, so it never shrinksTrim old turns or summarize the history
max_tokens set too highReply cap reserves more space than is freeLower max_tokens
Input near the limit on a small modelModel window is only a few thousand tokensSwitch to a larger-context model

Prerequisites

You only need the openai SDK, the tiktoken token counter, and python-dotenv for your key. Confirm you are on Python 3.10 or newer, then install:

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "tiktoken>=0.7" "python-dotenv>=1.0"

Store your key in a .env file and keep it out of version control:

OPENAI_API_KEY=sk-your-real-key-goes-here
echo ".env" >> .gitignore

That .gitignore line is non-negotiable — a key pushed to a public repository can be found and billed to you within minutes. If your environment is not set up yet, the Understanding LLM APIs section covers the full installation first.

Fix 1: Count tokens with tiktoken before you send

The first move is to stop guessing. tiktoken is OpenAI's own tokenizer, so it counts exactly the way the model does. Measure your messages before sending and you will know whether you are over the limit and by how much.

import tiktoken


def count_tokens(messages: list[dict], model: str = "gpt-4o-mini") -> int:
    """Count the tokens a list of chat messages will use."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")  # safe fallback
    tokens = 0
    for message in messages:
        tokens += 4  # every message carries a few tokens of overhead
        tokens += len(encoding.encode(message["content"]))
    tokens += 2  # the reply is primed with a couple of tokens
    return tokens


messages = [
    {"role": "system", "content": "You summarize reports concisely."},
    {"role": "user", "content": "Summarize this: " + "lorem ipsum " * 500},
]

input_tokens = count_tokens(messages)
print(f"Input tokens: {input_tokens}")

The per-message overhead exists because the model wraps each message in a little structure of its own. The count will not be perfect to the last token, but it is close enough to keep you safely under any window. Compare the result against your model's limit and the planned reply size before you ever make a call.

Fix 2: Trim or chunk the input

If a single document is the problem, you have two choices. Trimming keeps only the part that fits. Chunking splits the document into pieces that each fit the window, processes them one at a time, then combines the results. Chunking is the right answer when you cannot afford to throw text away.

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")


def chunk_text(text: str, max_tokens: int = 3000) -> list[str]:
    """Split text into chunks that each stay under max_tokens."""
    tokens = encoding.encode(text)
    chunks = []
    for start in range(0, len(tokens), max_tokens):
        piece = tokens[start:start + max_tokens]
        chunks.append(encoding.decode(piece))
    return chunks


long_document = "word " * 20000
pieces = chunk_text(long_document, max_tokens=3000)
print(f"Split into {len(pieces)} chunks")

for i, piece in enumerate(pieces, start=1):
    print(f"Chunk {i}: {len(encoding.encode(piece))} tokens")
    # send each chunk to the model separately, then combine the replies

Encoding the whole text, slicing the token list, and decoding each slice guarantees every chunk is genuinely under the limit — measured in tokens, not characters, so it is exact. Leave headroom (here 3,000 tokens out of a larger window) for the system message and the reply.

Fix 3: Summarize the conversation history

In a chatbot, the message list grows with every turn, so a long conversation eventually overflows the window on its own. The fix is to replace the old turns with a short summary the model writes for you, keeping the meaning while shedding most of the tokens.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def summarize_history(history: list[dict], keep_last: int = 4) -> list[dict]:
    """Replace old turns with a summary, keeping the most recent ones."""
    if len(history) <= keep_last:
        return history

    old_turns = history[:-keep_last]
    transcript = "\n".join(f"{m['role']}: {m['content']}" for m in old_turns)

    summary = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize this chat in 3 sentences."},
            {"role": "user", "content": transcript},
        ],
        max_tokens=150,
    )
    summary_text = summary.choices[0].message.content

    return [
        {"role": "system", "content": f"Earlier conversation: {summary_text}"},
        *history[-keep_last:],
    ]

This keeps the most recent turns verbatim, where detail matters most, and compresses everything older into three sentences. The conversation can run indefinitely without the token count creeping upward. If you are building a chatbot, the deeper version of this pattern lives in Add Memory to a Python Chatbot.

Fix 4: Lower max_tokens to reserve less reply space

Remember the rule: input_tokens + max_tokens must fit the window. When max_tokens is set high "just in case," it reserves space your input could be using. If your reply genuinely needs only a few hundred tokens, cap it there and free the rest for input.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You write one-line summaries."},
        {"role": "user", "content": "Summarize: " + "data " * 1000},
    ],
    max_tokens=60,   # a one-line reply needs little space; free the rest for input
)

print(response.choices[0].message.content)

This only helps when the input itself fits. If your prompt alone already overflows the window, no max_tokens value will save you — go back to Fix 2 or Fix 3 to shrink the input first.

Fix 5: Choose a larger-context model

Some inputs are simply large and should not be cut. A long contract, a full transcript, or a research paper may need to be seen whole. The cleanest fix there is a model with a bigger window. A small model might hold around 16,000 tokens; a larger one such as gpt-4o holds about 128,000 — roughly eight times the room.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# A 128k-token window absorbs far more input in a single call.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You answer questions about documents."},
        {"role": "user", "content": "Here is a long report:\n" + "fact " * 30000},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

A larger window is the least-effort fix, but it is not free: bigger models often cost more per token, and even a 128,000-token window has a ceiling. For genuinely huge data, pair a large model with the chunking from Fix 2. To compare windows and prices across providers, see OpenAI vs Anthropic API for Beginners and Best Free AI APIs for Beginners.

Key parameters at a glance

ParameterEffect on the limit
modelSets the window size. A larger-context model raises the total token budget.
max_tokensReserves space for the reply. Lower it to leave more room for input.
messagesHolds all input tokens. Trim or summarize this to shrink input.

Troubleshooting

  1. The error returns even after lowering max_tokens. Your input alone is over the window, so reserving less reply space cannot help. Count the input with tiktoken (Fix 1); if it is already near the limit, you must chunk or summarize the input rather than touch max_tokens.
  2. tiktoken.encoding_for_model raises a KeyError for a new model. The library does not yet know that model's name. Fall back to tiktoken.get_encoding("cl100k_base"), which matches most current OpenAI models, exactly as the count_tokens helper in Fix 1 does.
  3. Your token count looks right but the request still fails by a few tokens. Token counting is an estimate that misses small per-message overhead. Leave a safety margin — aim to stay 5-10% under the window rather than right at the edge.
  4. A chunked job gives disconnected or repetitive answers. Each chunk was processed with no memory of the others. Summarize every chunk first, then make one final call that combines the summaries, so the model sees the whole picture at the end.

When to use this vs. alternatives

  • Trim or summarize when the task is a chat that keeps growing — it is cheap and keeps recent detail sharp, which matters most in conversation.
  • Chunk the input when you must process a large document in full and cannot drop any of it, accepting the extra calls that come with it.
  • Switch to a larger-context model when the input is large, indivisible, and worth the higher per-call cost — the least code, but not the cheapest.

Back to Understanding LLM APIs.