Fundamentals

Understanding LLM APIs: A Step-by-Step Python Guide for Beginners

You have a task an AI could clearly help with: drafting replies, summarising a report, sorting messy notes. You have heard that large language models can do this. But every tutorial seems to assume you already know what an "endpoint" is, what a "token" costs, and why your first attempt returns a wall of red error text instead of an answer. That gap is what this guide closes.

A large language model API (application programming interface — a web address your program talks to) lets you borrow a powerful AI model over the internet. You never train it, host it, or even download it. Your Python script sends some text; a model running on the provider's servers writes a reply; the reply comes back to your script as data you can read and reuse. By the end of this guide you will install the right tools, store your key without leaking it, send a real request, understand every setting you can tune, and fix the four or five errors that trip up almost everyone on their first day.

This is one section of Python AI Fundamentals for Non-Developers, written for creators, marketers, founders, and students who are comfortable copying a command but have never shipped production code.

Who this is for and what you will build

You need this guide if you can run a Python file but have never made one talk to an AI service. The task is simple to state and surprisingly easy to get slightly wrong: take a string of text, send it to a model, and get a useful reply back, reliably, without exposing your billing key or blowing your budget.

We will build that piece by piece. First the environment, so nothing conflicts. Then secure key handling, so your credentials stay yours. Then a real request and a careful read of the response. Then the knobs you can turn to change how the model behaves. Each step is a runnable Python file, not a fragment, so you can paste it and watch it work.

The flow below is the whole mental model. Everything in this guide is a piece of this picture.

How a Python request travels through an LLM API Your Python script sends a prompt over HTTPS; the service tokenises the text, the model predicts tokens, and a JSON response with text and usage returns to your script. Your Pythonscript Tokenizertext to tokens Modelpredicts tokens JSON responsetext + usage request: prompt+ parameters response travelsback the same way
Every call is a round trip: your script sends text and settings, the service turns text into tokens, the model predicts a reply, and a JSON response carries the text and token usage back to you.

Prerequisites: setting up a clean environment

A clean, isolated workspace stops one project's libraries from breaking another's. Confirm you are on Python 3.10 or newer, since older versions reached end-of-life and miss features the modern SDK relies on:

python --version

If that prints anything below 3.10, install a current version first — the Setting Up Python for AI section walks through it for each operating system.

Now create a virtual environment (a private folder that holds this project's libraries) and install what you need:

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install "openai>=1.40" "httpx>=0.27" "python-dotenv>=1.0"
pip freeze > requirements.txt

We install three things. The openai SDK is the friendly, official wrapper that turns a network call into one Python function. httpx is a modern HTTP library; the SDK uses it under the hood, and we will use it directly once to show what is really happening on the wire. python-dotenv loads secrets from a file so they never live in your code. Pinning versions with pip freeze means the same code runs the same way next month and on a teammate's machine.

Next, store your key. Generate one in your provider's dashboard, then create a file named .env in your project folder:

OPENAI_API_KEY=sk-your-real-key-goes-here

Treat that key like the password to your bank. Immediately add .env to your .gitignore file so it is never committed or shared:

echo ".env" >> .gitignore

That one line is the difference between a private credential and a public one. A key pushed to a repository can be found and used by strangers within minutes, and the charges land on you.

Step 1: Send your first request with the openai SDK

With the environment ready, a working call is only a few lines. The pattern is always the same: load the key, create a client, then send a list of messages.

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # reads .env and puts the key into the environment

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Explain what a token is in one sentence."},
    ],
)

print(response.choices[0].message.content)

The messages list is a short conversation. A system message sets the model's role and rules; a user message is your actual request. The model reads both and writes an assistant message in reply. You get that reply at response.choices[0].message.content. The gpt-4o-mini model is small, fast, and cheap — perfect for learning. Run this file and you should see a single tidy sentence print to your terminal.

Step 2: Read the response and track your usage

The reply text is the headline, but the response carries more. The most important extra is usage — the count of tokens consumed, which is exactly what you are billed on. Logging it from day one keeps costs from surprising you.

print("Reply:", response.choices[0].message.content)
print("Why it stopped:", response.choices[0].finish_reason)

usage = response.usage
print(f"Prompt tokens:     {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens:      {usage.total_tokens}")

finish_reason tells you why the model stopped. "stop" means it finished naturally; "length" means it hit your max_tokens cap and was cut off mid-thought — a sign to raise the limit. The token counts let you estimate cost: multiply by the model's per-token price from the dashboard. A reply that cost a fraction of a cent today can cost real money at scale, so make this visible early.

Step 3: See the raw HTTP call with httpx

The SDK hides the network so you can focus on your task, but it helps to see what it sends just once. Underneath, the SDK makes an ordinary HTTPS request — a POST with a header carrying your key and a JSON body carrying your prompt. Here is that same call written by hand with httpx:

import os
import httpx
from dotenv import load_dotenv

load_dotenv()

response = httpx.post(
    "https://api.openai.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
    json={
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": "Say hello in five words."}],
    },
    timeout=30.0,
)

response.raise_for_status()          # turns a 4xx/5xx status into an exception
data = response.json()               # parse the JSON body into a dict
print(data["choices"][0]["message"]["content"])

Three things are worth noticing. The Authorization header is how the server knows the request is yours — a wrong or missing key here is exactly what causes a 401 error. The json= body is the payload the SDK builds for you automatically. And raise_for_status() plus response.json() are the manual steps the SDK normally does on your behalf. You will almost always prefer the SDK, but now the magic is no longer a mystery.

Step 4: Tune the model's behaviour with parameters

The same prompt can produce a wide range of replies depending on a handful of settings you pass alongside it. These control length, randomness, and format. Understanding them is the difference between fighting the model and directing it.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You write short marketing taglines."},
        {"role": "user", "content": "A tagline for a calm productivity app."},
    ],
    temperature=0.9,     # higher = more varied, creative wording
    max_tokens=30,       # hard cap on the reply length
    top_p=1.0,           # alternative way to limit randomness
    n=3,                 # ask for three separate options at once
)

for i, choice in enumerate(response.choices, start=1):
    print(f"Option {i}: {choice.message.content}")

For a creative task, a higher temperature gives you variety; for a factual one, drop it near zero so answers stay consistent. Asking for n=3 returns three candidates in a single call, which is handy for brainstorming. The next section explains each setting in full. To go deeper on writing the messages themselves, the Prompt Engineering Basics section covers system prompts and output control.

Parameter reference

These are the settings you will reach for most. Pass them as keyword arguments to chat.completions.create. Defaults shown are the common OpenAI defaults; other providers are similar but check their docs.

NameTypeDefaultEffect
modelstringnone (required)Which model answers. gpt-4o-mini is cheap and fast; larger models reason better but cost more.
messageslist of dictsnone (required)The conversation. Each item has a role (system, user, or assistant) and content.
temperaturefloat1.0Randomness, from 0.0 to 2.0. Low values give consistent, focused replies; high values give varied, creative ones.
max_tokensintegermodel maxHard ceiling on the reply length in tokens. Set it low while testing to cap costs.
top_pfloat1.0Nucleus sampling, an alternative to temperature. Lower values narrow word choice. Tune one, not both.
ninteger1How many separate replies to generate per call. Each one is billed.
stopstring or listnullText that, when produced, ends the reply early. Useful for fixed formats.
streambooleanfalseWhen true, the reply arrives token by token instead of all at once.
response_formatdictnullSet to {"type": "json_object"} to force valid JSON output.
timeoutfloatSDK defaultSeconds to wait before giving up on a slow request.

Troubleshooting common errors

These are the errors almost everyone hits in their first week. Each gets a dedicated guide if you need the deep version.

  1. AuthenticationError: Error code: 401 - Incorrect API key provided — Your key is missing, mistyped, or not being loaded. Most often .env was never read, so the variable is empty. Confirm load_dotenv() runs before you create the client and that the key in .env has no quotes or stray spaces. Full walkthrough: Fix the 401 Unauthorized Error in OpenAI Python.
  2. RateLimitError: Error code: 429 - Rate limit reached for requests — You sent calls faster than your tier allows, or you have hit a spending cap. Wait, then retry with an increasing delay (exponential backoff). Step-by-step fix: Fix the 429 Rate-Limit Error in Python.
  3. json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) — You tried to parse the model's text as JSON, but it wrapped the JSON in prose or code fences. Add response_format={"type": "json_object"} and ask explicitly for JSON. Details: Fix JSONDecodeError with AI API Responses in Python.
  4. BadRequestError: ... maximum context length is N tokens, however you requested M — Your prompt plus requested reply is larger than the model's window. Shorten the input, summarise long documents, or lower max_tokens. How to fix it: Fix the Context-Length-Exceeded Error in Python.
  5. APITimeoutError: Request timed out — The model took longer than your timeout allowed, common with large prompts or long replies. Raise the timeout value (for example timeout=60) and consider streaming so partial output arrives sooner.
  6. AttributeError: 'NoneType' object has no attribute 'content' — You read message.content when the reply was empty, often because the request was filtered or stopped early. Check finish_reason before using the text and handle the empty case instead of assuming a string is always present.

Worked example: a small, safe API client

This script ties everything together. It loads the key safely, sends a request, retries politely when rate-limited, and reports both the reply and the token cost. Save it as ask.py and run it.

import os
import time
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError, APITimeoutError

load_dotenv()  # pulls OPENAI_API_KEY from .env (remember: .env is in .gitignore)

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), timeout=30.0)


def ask(prompt: str, max_retries: int = 3) -> str:
    """Send one prompt, retry on rate limits, and return the reply text."""
    messages = [
        {"role": "system", "content": "You are a concise, helpful assistant."},
        {"role": "user", "content": prompt},
    ]
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                temperature=0.3,   # low = consistent answers
                max_tokens=200,    # cap the reply to control cost
            )
            usage = response.usage
            print(f"[tokens: {usage.total_tokens} total]")  # keep cost visible
            return response.choices[0].message.content
        except (RateLimitError, APITimeoutError) as error:
            wait = 2 ** attempt + 0.5  # 1.5s, 2.5s, 4.5s — exponential backoff
            print(f"Attempt {attempt + 1} failed ({error.__class__.__name__}); "
                  f"retrying in {wait}s...")
            time.sleep(wait)
    raise RuntimeError(f"Gave up after {max_retries} attempts.")


if __name__ == "__main__":
    answer = ask("Summarise what an LLM API does in two sentences.")
    print(answer)

Run it with python ask.py. You get a clean answer, a one-line token report, and automatic recovery if the service briefly throttles you — the three habits that separate a toy script from one you can trust.

Next steps

You can now call a model, read its reply, tune its behaviour, and recover from the common failures. Here is where to go next, depending on what you want to do.

Back to Python AI Fundamentals for Non-Developers.