This guide shows you how to diagnose and fix the 429 rate-limit error from AI APIs in Python in under fifteen minutes. The 429 status code (the number HTTP uses for "Too Many Requests") means the provider accepted your request but refused to run it because you sent too much, too fast. Your code is almost certainly fine — you just need to slow down and retry politely. By the end you will have a drop-in helper that handles this automatically.
This page sits under Understanding LLM APIs, so it assumes you already have a working API call. If you are still wiring one up, start with Best Free AI APIs for Beginners.
What the error actually looks like
When you call the OpenAI SDK and trip a limit, you get something like this:
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit
reached for gpt-4o in organization org-abc123 on requests per min (RPM):
Limit 3500, Used 3500, Requested 1. Please try again in 17ms.', 'type':
'requests', 'code': 'rate_limit_exceeded'}}
The single most useful habit is to print the whole message. It tells you which limit you hit — requests per min (RPM), tokens per min (TPM), or a billing quota — and that decides the fix.
import openai
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from the environment
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)
except openai.RateLimitError as error:
# Print the full body so you can see WHICH limit you hit
print("Rate limited:", error.message)
print("Response headers:", dict(error.response.headers))
Store your key in a .env file and load it with python-dotenv rather than pasting it in code. Always add .env to your .gitignore so your key never reaches GitHub.
Cause and fix quick-reference
| Message says | Cause | Fastest fix |
|---|---|---|
requests per min (RPM) | Too many calls in 60 seconds | Add backoff and retry; batch work into fewer calls |
tokens per min (TPM) | Prompts or max_tokens too large | Lower max_tokens; shorten input; spread calls out |
quota / insufficient_quota | Out of credit or no billing set up | Add a payment method; raise your usage limit |
Please try again in Ns | Short, temporary spike | Sleep for that duration, then retry once |
| 429 under heavy parallelism | Too many concurrent requests | Cap concurrency with a semaphore |
The crucial split: a rate limit (RPM or TPM) clears in seconds and retrying works. A quota error (the words quota or insufficient_quota) will never clear by retrying — you must add billing or raise your limit. Retrying a quota error just wastes time.
Step 1: Add exponential backoff with tenacity
Exponential backoff means: when a call fails, wait a bit and try again; if it fails again, wait twice as long; keep doubling up to a ceiling. This gives the per-minute window time to reset. The tenacity library does it in a few lines.
pip install tenacity openai python-dotenv
from openai import OpenAI
import openai
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
client = OpenAI()
@retry(
retry=retry_if_exception_type(openai.RateLimitError),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(6),
)
def ask(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
print(ask("Write one sentence about backoff."))
wait_exponential(min=2, max=60) waits 2 seconds, then 4, 8, 16, and so on, capped at 60. retry_if_exception_type(openai.RateLimitError) makes sure you only retry rate-limit failures, not real bugs like a typo in the model name. stop_after_attempt(6) prevents an infinite loop if the limit never clears.
Step 2: Do the same thing manually (no extra library)
If you would rather not add a dependency, the same logic is a short loop. This is worth understanding even if you use tenacity, because it shows exactly what backoff does.
import time
import openai
from openai import OpenAI
client = OpenAI()
def ask_with_backoff(prompt: str, max_retries: int = 6) -> str:
delay = 2.0 # seconds, doubles each failure
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
except openai.RateLimitError as error:
if attempt == max_retries - 1:
raise # give up after the last attempt
print(f"429 hit. Waiting {delay:.0f}s (attempt {attempt + 1}).")
time.sleep(delay)
delay = min(delay * 2, 60) # cap the wait at 60s
raise RuntimeError("Unreachable")
The pattern is always the same: catch RateLimitError, sleep, double the delay, cap it, and re-raise on the last attempt so genuine outages still surface.
Step 3: Respect the Retry-After header
Many providers tell you exactly how long to wait in a Retry-After response header (a value in seconds). Honouring it is more polite and more efficient than guessing, because you wait the real amount rather than a doubling estimate.
import time
import openai
from openai import OpenAI
client = OpenAI()
def ask_respecting_retry_after(prompt: str, max_retries: int = 6) -> str:
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
except openai.RateLimitError as error:
if attempt == max_retries - 1:
raise
# Use the server's suggested wait if present, else fall back
retry_after = error.response.headers.get("retry-after")
wait = float(retry_after) if retry_after else 2 ** attempt
print(f"429. Server asked to wait {wait:.1f}s.")
time.sleep(wait)
raise RuntimeError("Unreachable")
When Retry-After is missing, the code falls back to 2 ** attempt (1, 2, 4, 8 seconds) so you always have a sane delay.
Step 4: Batch requests and lower concurrency
Backoff handles bursts, but the real cure for repeated 429s is sending less. Two levers matter.
Lower max_tokens. Token-per-minute limits count both your input and the model's output. A generous max_tokens reserves a big slice of your budget on every call. Cap it to what you actually need:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize in one line."}],
max_tokens=60, # don't reserve thousands of tokens you won't use
)
Cap concurrency. If you fan out requests with threads or async, you can blow past the per-minute limit in seconds. A semaphore (a counter that only lets N tasks run at once) keeps you under the ceiling.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
semaphore = asyncio.Semaphore(5) # at most 5 calls in flight at once
async def ask_async(prompt: str) -> str:
async with semaphore: # blocks until a slot is free
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=80,
)
return response.choices[0].message.content
async def main() -> None:
prompts = [f"Give me fact #{i}." for i in range(40)]
results = await asyncio.gather(*(ask_async(p) for p in prompts))
for line in results:
print(line)
asyncio.run(main())
Wherever you can, also batch logically related work into one prompt — asking for ten summaries in a single call uses one request slot instead of ten.
Key parameters quick-reference
| Parameter | What it controls | Sensible starting value |
|---|---|---|
wait_exponential(min, max) | Backoff floor and ceiling, in seconds | min=2, max=60 |
stop_after_attempt(n) | How many tries before giving up | 6 |
Semaphore(n) | Max requests running at the same time | 5 (raise slowly) |
Troubleshooting
- Retrying forever and never recovering. If the message contains
insufficient_quotaorquota, you are out of credit, not rate-limited. No amount of backoff helps. Add a payment method in your provider dashboard and raise your monthly usage limit. - Still getting 429 with only a handful of calls. You are almost certainly hitting the tokens-per-minute limit, not requests-per-minute. Print the message to confirm it says
(TPM), then lowermax_tokensand shorten your prompt. Very long contexts are a frequent cause — see Fix the Context-Length-Exceeded Error in Python. - Backoff catches the wrong errors. If your retry wrapper also swallows authentication or JSON failures, it will retry pointlessly. Catch only
openai.RateLimitError. A401is a credential problem — see Fix the 401 Unauthorized Error in OpenAI Python — and a malformed body is covered in Fix JSONDecodeError with AI API Responses in Python. - Limit returns the moment your script ends. Concurrency spikes are bursty. Confirm a single sequential call works, then reintroduce parallelism behind a
Semaphoreand increase the count one step at a time until you stay clean.
Worked example: a resilient, rate-aware client
This script combines everything above — Retry-After-aware backoff, capped concurrency, a modest max_tokens — into one reusable async helper you can paste into a project.
import asyncio
import openai
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv() # remember: add .env to your .gitignore
client = AsyncOpenAI()
semaphore = asyncio.Semaphore(5) # cap concurrent calls
async def ask(prompt: str, max_retries: int = 6) -> str:
"""Call the API with backoff that honours Retry-After."""
async with semaphore:
for attempt in range(max_retries):
try:
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=120,
)
return response.choices[0].message.content
except openai.RateLimitError as error:
# A quota error will never clear — fail loudly instead of looping.
if "insufficient_quota" in str(error):
raise RuntimeError("Out of quota: add billing.") from error
if attempt == max_retries - 1:
raise
retry_after = error.response.headers.get("retry-after")
wait = float(retry_after) if retry_after else 2 ** attempt
print(f"429 on '{prompt[:20]}...' waiting {wait:.1f}s")
await asyncio.sleep(wait)
raise RuntimeError("Exhausted retries")
async def main() -> None:
prompts = [f"One fun fact about the number {n}." for n in range(20)]
answers = await asyncio.gather(*(ask(p) for p in prompts))
for prompt, answer in zip(prompts, answers):
print(f"Q: {prompt}\nA: {answer}\n")
if __name__ == "__main__":
asyncio.run(main())
Run it and you will see most calls succeed instantly, the occasional 429 get absorbed by a short wait, and quota errors stop the program cleanly with an actionable message.
When to use this vs. alternatives
- Use client-side backoff (this guide) when you are a single user or a small script and just need calls to stop crashing on bursts. It is the right tool for almost every beginner case.
- Use a server-side rate limiter when you are building an app that exposes AI features to your own users and need to throttle each of them fairly. That is a different problem — see Rate-Limit AI API Calls in a SaaS with Python.
- Raise your limit instead of retrying when you consistently need more throughput than backoff can buy you. Upgrade your usage tier in the provider dashboard rather than fighting the ceiling in code.
Back to Understanding LLM APIs.
Related guides
- Understanding LLM APIs — the main guide for working with AI APIs in Python.
- Fix the 401 Unauthorized Error in OpenAI Python — when the problem is your key, not your rate.
- Fix JSONDecodeError with AI API Responses in Python — handle malformed or non-JSON responses.
- Fix the Context-Length-Exceeded Error in Python — shrink prompts that blow the token budget.
- Rate-Limit AI API Calls in a SaaS with Python — throttle your own users server-side.