Fundamentals

Fix the 429 Rate-Limit Error in Python

This guide shows you how to diagnose and fix the 429 rate-limit error from AI APIs in Python in under fifteen minutes. The 429 status code (the number HTTP uses for "Too Many Requests") means the provider accepted your request but refused to run it because you sent too much, too fast. Your code is almost certainly fine — you just need to slow down and retry politely. By the end you will have a drop-in helper that handles this automatically.

This page sits under Understanding LLM APIs, so it assumes you already have a working API call. If you are still wiring one up, start with Best Free AI APIs for Beginners.

What the error actually looks like

When you call the OpenAI SDK and trip a limit, you get something like this:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit
reached for gpt-4o in organization org-abc123 on requests per min (RPM):
Limit 3500, Used 3500, Requested 1. Please try again in 17ms.', 'type':
'requests', 'code': 'rate_limit_exceeded'}}

The single most useful habit is to print the whole message. It tells you which limit you hit — requests per min (RPM), tokens per min (TPM), or a billing quota — and that decides the fix.

import openai
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment

try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Say hello"}],
    )
    print(response.choices[0].message.content)
except openai.RateLimitError as error:
    # Print the full body so you can see WHICH limit you hit
    print("Rate limited:", error.message)
    print("Response headers:", dict(error.response.headers))

Store your key in a .env file and load it with python-dotenv rather than pasting it in code. Always add .env to your .gitignore so your key never reaches GitHub.

Cause and fix quick-reference

Message saysCauseFastest fix
requests per min (RPM)Too many calls in 60 secondsAdd backoff and retry; batch work into fewer calls
tokens per min (TPM)Prompts or max_tokens too largeLower max_tokens; shorten input; spread calls out
quota / insufficient_quotaOut of credit or no billing set upAdd a payment method; raise your usage limit
Please try again in NsShort, temporary spikeSleep for that duration, then retry once
429 under heavy parallelismToo many concurrent requestsCap concurrency with a semaphore

The crucial split: a rate limit (RPM or TPM) clears in seconds and retrying works. A quota error (the words quota or insufficient_quota) will never clear by retrying — you must add billing or raise your limit. Retrying a quota error just wastes time.

Step 1: Add exponential backoff with tenacity

Exponential backoff means: when a call fails, wait a bit and try again; if it fails again, wait twice as long; keep doubling up to a ceiling. This gives the per-minute window time to reset. The tenacity library does it in a few lines.

pip install tenacity openai python-dotenv
from openai import OpenAI
import openai
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)

client = OpenAI()


@retry(
    retry=retry_if_exception_type(openai.RateLimitError),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(6),
)
def ask(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content


print(ask("Write one sentence about backoff."))

wait_exponential(min=2, max=60) waits 2 seconds, then 4, 8, 16, and so on, capped at 60. retry_if_exception_type(openai.RateLimitError) makes sure you only retry rate-limit failures, not real bugs like a typo in the model name. stop_after_attempt(6) prevents an infinite loop if the limit never clears.

Step 2: Do the same thing manually (no extra library)

If you would rather not add a dependency, the same logic is a short loop. This is worth understanding even if you use tenacity, because it shows exactly what backoff does.

import time
import openai
from openai import OpenAI

client = OpenAI()


def ask_with_backoff(prompt: str, max_retries: int = 6) -> str:
    delay = 2.0  # seconds, doubles each failure
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
            )
            return response.choices[0].message.content
        except openai.RateLimitError as error:
            if attempt == max_retries - 1:
                raise  # give up after the last attempt
            print(f"429 hit. Waiting {delay:.0f}s (attempt {attempt + 1}).")
            time.sleep(delay)
            delay = min(delay * 2, 60)  # cap the wait at 60s
    raise RuntimeError("Unreachable")

The pattern is always the same: catch RateLimitError, sleep, double the delay, cap it, and re-raise on the last attempt so genuine outages still surface.

Step 3: Respect the Retry-After header

Many providers tell you exactly how long to wait in a Retry-After response header (a value in seconds). Honouring it is more polite and more efficient than guessing, because you wait the real amount rather than a doubling estimate.

import time
import openai
from openai import OpenAI

client = OpenAI()


def ask_respecting_retry_after(prompt: str, max_retries: int = 6) -> str:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
            )
            return response.choices[0].message.content
        except openai.RateLimitError as error:
            if attempt == max_retries - 1:
                raise
            # Use the server's suggested wait if present, else fall back
            retry_after = error.response.headers.get("retry-after")
            wait = float(retry_after) if retry_after else 2 ** attempt
            print(f"429. Server asked to wait {wait:.1f}s.")
            time.sleep(wait)
    raise RuntimeError("Unreachable")

When Retry-After is missing, the code falls back to 2 ** attempt (1, 2, 4, 8 seconds) so you always have a sane delay.

Step 4: Batch requests and lower concurrency

Backoff handles bursts, but the real cure for repeated 429s is sending less. Two levers matter.

Lower max_tokens. Token-per-minute limits count both your input and the model's output. A generous max_tokens reserves a big slice of your budget on every call. Cap it to what you actually need:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize in one line."}],
    max_tokens=60,  # don't reserve thousands of tokens you won't use
)

Cap concurrency. If you fan out requests with threads or async, you can blow past the per-minute limit in seconds. A semaphore (a counter that only lets N tasks run at once) keeps you under the ceiling.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()
semaphore = asyncio.Semaphore(5)  # at most 5 calls in flight at once


async def ask_async(prompt: str) -> str:
    async with semaphore:  # blocks until a slot is free
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=80,
        )
        return response.choices[0].message.content


async def main() -> None:
    prompts = [f"Give me fact #{i}." for i in range(40)]
    results = await asyncio.gather(*(ask_async(p) for p in prompts))
    for line in results:
        print(line)


asyncio.run(main())

Wherever you can, also batch logically related work into one prompt — asking for ten summaries in a single call uses one request slot instead of ten.

Key parameters quick-reference

ParameterWhat it controlsSensible starting value
wait_exponential(min, max)Backoff floor and ceiling, in secondsmin=2, max=60
stop_after_attempt(n)How many tries before giving up6
Semaphore(n)Max requests running at the same time5 (raise slowly)

Troubleshooting

  1. Retrying forever and never recovering. If the message contains insufficient_quota or quota, you are out of credit, not rate-limited. No amount of backoff helps. Add a payment method in your provider dashboard and raise your monthly usage limit.
  2. Still getting 429 with only a handful of calls. You are almost certainly hitting the tokens-per-minute limit, not requests-per-minute. Print the message to confirm it says (TPM), then lower max_tokens and shorten your prompt. Very long contexts are a frequent cause — see Fix the Context-Length-Exceeded Error in Python.
  3. Backoff catches the wrong errors. If your retry wrapper also swallows authentication or JSON failures, it will retry pointlessly. Catch only openai.RateLimitError. A 401 is a credential problem — see Fix the 401 Unauthorized Error in OpenAI Python — and a malformed body is covered in Fix JSONDecodeError with AI API Responses in Python.
  4. Limit returns the moment your script ends. Concurrency spikes are bursty. Confirm a single sequential call works, then reintroduce parallelism behind a Semaphore and increase the count one step at a time until you stay clean.

Worked example: a resilient, rate-aware client

This script combines everything above — Retry-After-aware backoff, capped concurrency, a modest max_tokens — into one reusable async helper you can paste into a project.

import asyncio
import openai
from openai import AsyncOpenAI
from dotenv import load_dotenv

load_dotenv()  # remember: add .env to your .gitignore

client = AsyncOpenAI()
semaphore = asyncio.Semaphore(5)  # cap concurrent calls


async def ask(prompt: str, max_retries: int = 6) -> str:
    """Call the API with backoff that honours Retry-After."""
    async with semaphore:
        for attempt in range(max_retries):
            try:
                response = await client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=120,
                )
                return response.choices[0].message.content
            except openai.RateLimitError as error:
                # A quota error will never clear — fail loudly instead of looping.
                if "insufficient_quota" in str(error):
                    raise RuntimeError("Out of quota: add billing.") from error
                if attempt == max_retries - 1:
                    raise
                retry_after = error.response.headers.get("retry-after")
                wait = float(retry_after) if retry_after else 2 ** attempt
                print(f"429 on '{prompt[:20]}...'  waiting {wait:.1f}s")
                await asyncio.sleep(wait)
        raise RuntimeError("Exhausted retries")


async def main() -> None:
    prompts = [f"One fun fact about the number {n}." for n in range(20)]
    answers = await asyncio.gather(*(ask(p) for p in prompts))
    for prompt, answer in zip(prompts, answers):
        print(f"Q: {prompt}\nA: {answer}\n")


if __name__ == "__main__":
    asyncio.run(main())

Run it and you will see most calls succeed instantly, the occasional 429 get absorbed by a short wait, and quota errors stop the program cleanly with an actionable message.

When to use this vs. alternatives

  • Use client-side backoff (this guide) when you are a single user or a small script and just need calls to stop crashing on bursts. It is the right tool for almost every beginner case.
  • Use a server-side rate limiter when you are building an app that exposes AI features to your own users and need to throttle each of them fairly. That is a different problem — see Rate-Limit AI API Calls in a SaaS with Python.
  • Raise your limit instead of retrying when you consistently need more throughput than backoff can buy you. Upgrade your usage tier in the provider dashboard rather than fighting the ceiling in code.

Back to Understanding LLM APIs.