You have probably typed a question into an AI chat box, got a decent answer, then tried the same thing the next day and got something messy, off-topic, or in the wrong format. That gap between "it worked once" and "it works every time" is what prompt engineering closes. Prompt engineering is simply the craft of writing the instructions you send to a language model so it returns the result you actually need, reliably and in a shape you can use.
This guide is for creators, marketers, founders, and students who can run a Python file but are not full-time programmers. You will move from typing prompts by hand into a chat window to sending them from a short, repeatable Python script. That shift matters because a script lets you reuse a proven prompt, run it across hundreds of inputs, and check the output automatically instead of eyeballing each result.
A "language model" (or LLM, short for large language model) is the AI behind tools like ChatGPT. When you call it from Python, you send it a list of messages and a few settings, and it sends back text. The model has no memory of earlier calls and no hidden knowledge of your intent; everything it acts on lives in the messages you send and the settings you attach. That is liberating once it clicks, because a vague result is almost never the model being stubborn. It is the prompt leaving room for interpretation, and a prompt is something you can edit and test until that room disappears.
The whole skill is therefore learning what to put in those messages and how to tune the settings. By the end you will know how to split instructions into system and user prompts, teach the model by example, force clean structured output like JSON, and iterate until a prompt is dependable. Every code block stands alone, so you can paste it into a file and run it as soon as you finish reading.
If you are brand new to Python and AI, the broader Python AI Fundamentals for Non-Developers hub walks through the surrounding pieces. To understand how the requests below actually travel to the model and come back, read the sibling section on Understanding LLM APIs.
Prerequisites
You need Python 3.10 or newer and a working virtual environment. If you have not set one up, follow Setting Up Python for AI first, then come back here. You also need an OpenAI API key, which is a secret password that lets your code talk to the model.
Install the two packages used throughout this guide. The openai package is the official SDK (software development kit, the ready-made code that talks to the API for you), and python-dotenv reads secrets from a file so they never end up in your code.
pip install openai python-dotenv
Create a file named .env in your project folder and paste your key into it:
OPENAI_API_KEY=sk-proj-your-real-key-here
Before you do anything else, add .env to your .gitignore file so your secret key is never committed to version control or pushed to a public repository. One leaked key can run up a large bill on your account.
echo ".env" >> .gitignore
With that in place, the boilerplate at the top of every script in this guide loads your key once:
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Step 1 — Separate the system prompt from the user prompt
Every request you send is a list of messages, and each message has a role. The two roles you will use most are system and user. The system prompt is where you tell the model who it is, what rules to follow, and how to behave for the entire conversation. The user prompt is the specific thing you want done right now. Keeping them separate is the single biggest upgrade over typing one long blob into a chat box: the rules stay constant while the task changes.
Think of it like hiring an assistant. The system prompt is the job description and house style you give them on day one: who they are, what they must never do, the tone they speak in, and the format every deliverable should follow. The user prompt is the task you hand them each morning. You write the job description once and reuse it forever, while the morning tasks change endlessly without ever touching the rules.
There is a practical reason this matters beyond tidiness. When the rules and the task share one blob of text, the model has to guess where your standing instructions end and the request begins, and it sometimes treats data as an instruction or vice versa. Splitting them into separate roles removes that ambiguity. The system message carries weight across the whole exchange, so anything you put there ("always reply in one sentence") applies to every user message that follows, even ones you have not written yet.
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def summarize(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are a concise editor. Summarize the user's text in "
"exactly one sentence. Never add opinions or extra detail."
),
},
{"role": "user", "content": text},
],
temperature=0.2,
)
return response.choices[0].message.content
print(summarize("Our quarterly sales rose 12 percent, driven mostly by repeat customers in Europe."))
The model now follows the system rules no matter what text arrives in the user message. Swap the user text for any other paragraph and you still get a single-sentence summary. That separation is what makes a prompt reusable. Notice two deliberate choices in the system content. The instruction is specific ("exactly one sentence") rather than soft ("keep it short"), because the model honours measurable rules far more reliably than fuzzy ones. And the rule that bans opinions and extra detail closes off the most common ways a summary drifts: editorialising and padding. A good system prompt spends as much energy saying what not to do as what to do.
The low temperature=0.2 is also intentional. Summaries should be steady, so the same paragraph produces nearly the same one-liner each time. The habit to form now is simple: whenever the task has a single correct shape, push temperature toward zero.
Step 2 — Teach the model with few-shot examples
Sometimes a plain instruction is not enough and the model guesses at the format you want. The fix is "few-shot" prompting: you show the model two or three completed examples before giving it the real input. The examples are the most reliable way to lock in tone, structure, and edge cases, because the model copies the pattern instead of inventing one.
You provide examples by adding fake user and assistant turns to the messages list. The assistant role represents the model's own past replies, so an assistant message you write by hand reads as "here is exactly how you should respond". The model treats those handwritten turns as proof of its own style, which is why few-shot prompting is so effective: you are not describing the output you want, you are demonstrating it, and demonstration leaves nothing to interpretation. The order matters too. Each example is a complete user then assistant pair, and the real request comes last as a lone user turn so the model knows that is the one to answer. Below, we teach the model to turn a raw product note into a punchy tagline.
def write_tagline(product_note: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You write short marketing taglines, 6 words max."},
{"role": "user", "content": "A water bottle that keeps drinks cold for 24 hours."},
{"role": "assistant", "content": "Cold sips, all day long."},
{"role": "user", "content": "Noise-cancelling headphones with a 40-hour battery."},
{"role": "assistant", "content": "Silence that lasts the whole week."},
{"role": "user", "content": product_note},
],
temperature=0.7,
)
return response.choices[0].message.content
print(write_tagline("A standing desk that adjusts height with one tap."))
Two examples are usually enough. If the model still drifts, add one more example that covers the case it gets wrong, rather than writing a longer paragraph of instructions. Examples almost always beat explanations, because an explanation describes the target from the outside while an example places the model directly inside it. There is a cost to weigh, though: every example becomes part of the request and counts toward the token budget you pay for. So choose examples that each teach something distinct. Two taglines that solve the same problem the same way waste space; two that show different lengths, tones, or edge cases earn their place. When the model fails on a specific kind of input, the fix is rarely a fourth generic example. It is one targeted example built from the exact case it fumbled. For ready-made example sets aimed at content work, see Prompt Engineering Templates for Marketers.
Step 3 — Control the output format and get clean JSON
If you plan to use the model's answer in another program, you need it in a predictable, machine-readable shape. Free-form prose is fine for a human to read but painful for code to parse. The cleanest target is JSON (JavaScript Object Notation, a simple text format of keys and values that nearly every tool understands).
There are two halves to forcing JSON, and you need both because each does a different job. First, describe the exact keys you want in the prompt. The response_format parameter guarantees the reply is syntactically valid JSON, but it does not care which keys you expect, so the prompt is the only place that decides whether you get name or full_name, and whether a missing field comes back as null or an empty string. Second, set response_format to {"type": "json_object"}, which makes the API guarantee the reply parses. Skip this second half and the model will sometimes wrap its JSON in a markdown code fence or add a sentence before it, and your json.loads call will crash on the stray text. With both halves in place you can call json.loads without any defensive string-cleaning, and a parse that works on one input works on a thousand.
import json
def extract_contact(raw_text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Extract contact details. Respond with JSON containing the keys "
"name, email, and company. Use null if a field is missing."
),
},
{"role": "user", "content": raw_text},
],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
print(extract_contact("Hi, I'm Dana Lee from Northwind. Reach me at dana@northwind.io."))
Setting temperature=0 here is deliberate: extraction should give the same answer every time, not creative variety. For deeper control over structure and a fuller treatment of validation, follow Write System Prompts that Control Output Format.
Step 4 — Iterate on a prompt until it is reliable
A first-draft prompt rarely behaves perfectly. Iteration is the loop where you change one element, rerun the script against the same inputs, and keep the version that performs best. The key discipline is changing only one thing at a time so you know what caused any improvement, exactly like adjusting a single ingredient in a recipe.
The reason for changing one element at a time is not pedantry. If you reword the instruction and lower the temperature in the same edit and the output improves, you have learned nothing about which change helped. Disciplined iteration turns prompt engineering into a controlled experiment: a fixed set of inputs, one variable changed, a result you can attribute to it.
The simplest way to compare versions is to run several prompts over a fixed set of test inputs and print the results side by side. The inputs should stay the same every time and should deliberately include the awkward cases, not just the easy ones, because a prompt that handles "Refund my order" but mangles "I love this but want a refund" is not yet reliable. Below, a small helper runs any system prompt against a shared list so you can judge two phrasings fairly.
def try_prompt(system_prompt: str, inputs: list[str]) -> None:
for text in inputs:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text},
],
temperature=0,
max_tokens=60,
)
print(f"IN: {text}\nOUT: {response.choices[0].message.content}\n")
tests = ["Refund my order #4471", "Where is my package?", "I love this product!"]
print("--- Version A ---")
try_prompt("Classify the message as: refund, shipping, or praise.", tests)
print("--- Version B ---")
try_prompt("Classify the message in one lowercase word: refund, shipping, or praise.", tests)
Version B adds "one lowercase word", which tends to produce cleaner, more uniform output. Run both, eyeball the results, and adopt the winner. Save your winning prompts in a file so you can track changes over time, like any other working asset.
Parameter reference
These are the settings you pass alongside your messages. Tuning them is half of prompt engineering, so keep this table handy.
| Parameter | Type | Default | Effect |
|---|---|---|---|
model | string | none (required) | Which model answers, e.g. gpt-4o-mini for cheap fast work or gpt-4o for harder reasoning. |
temperature | float | 1.0 | Randomness from 0.0 to 2.0. Use 0-0.3 for facts and JSON, 0.7-1.0 for creative copy. |
top_p | float | 1.0 | Nucleus sampling, an alternative to temperature. Lower values narrow word choice. Tune one or the other, not both. |
max_tokens | int | model limit | Caps the length of the reply. A token is roughly four characters; set this to avoid runaway, costly responses. |
response_format | object | {"type": "text"} | Set to {"type": "json_object"} to guarantee parseable JSON output. |
stop | list of strings | null | Strings that, when produced, end the response early. Useful for trimming trailing chatter. |
seed | int | null | Asks for repeatable output across runs when paired with temperature=0. Best-effort, not a guarantee. |
n | int | 1 | How many separate completions to return for one prompt. Raising it multiplies cost. |
Troubleshooting
These are the errors you will most likely hit while building the scripts above, with the real cause and a one-line fix.
openai.AuthenticationError: Incorrect API key provided— Your key is missing, mistyped, or the.envfile is in the wrong folder. Run your script from the same directory as.env, or printos.getenv("OPENAI_API_KEY")to confirm it loaded. For a full walkthrough, see Fix the 401 Unauthorized Error in OpenAI Python.openai.RateLimitError: Rate limit reached— You sent requests faster than your account tier allows, or you are out of credit. Slow down with a shorttime.sleep()between calls and check your billing balance. The deeper fix is in Fix the 429 Rate-Limit Error in Python.json.decoder.JSONDecodeError: Expecting value— The model wrapped its JSON in a markdown code fence or added a sentence around it. Addresponse_format={"type": "json_object"}to the call so the reply is always parseable.openai.BadRequestError: ... context length— Your messages plus the requestedmax_tokensexceed the model's window. Shorten the input or lowermax_tokens. See Fix the Context-Length-Exceeded Error in Python.- The model ignores your system prompt — The instruction is too vague or buried under conflicting examples. Make the rule specific ("reply in one lowercase word") and remove examples that contradict it.
- Output changes on every run — Your
temperatureis high. Drop it to0for extraction and classification tasks; reserve higher values for creative writing. - The JSON parses but a key is missing or misspelled —
response_formatonly guarantees valid JSON, never the right shape. The model returned{"full_name": ...}when your code expectedname, or dropped a key entirely. Name every key explicitly in the system prompt, say what to put when a value is unknown, and validate the parsed object in code before trusting it (the worked example below shows the guard pattern).
Worked example: a reusable classifier with validation
This script ties together everything above. It defines one system prompt with few-shot examples, forces JSON output, validates the result against a known set of labels, and runs across a batch of inputs while staying resilient when a single message fails. Save it as classify.py and run it with python classify.py. The inline comments explain why each part exists, not just what it does.
import json
import os
from dotenv import load_dotenv
from openai import OpenAI
# Load the API key from .env so the secret never lives in the code itself.
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# The single source of truth for valid labels. Validation below checks
# against this set, so adding a new category means editing one line here.
ALLOWED = {"refund", "shipping", "praise", "other"}
# The system prompt names every key the JSON must contain and spells out the
# allowed values. response_format guarantees valid JSON; this text guarantees
# the right *shape*. The two work together — neither is enough alone.
SYSTEM = (
"You label customer messages. Respond with JSON only, no prose: "
'{"category": "<one of refund, shipping, praise, other>", '
'"urgent": true or false}. Set urgent to true only when the customer '
"needs action today."
)
def classify(message: str) -> dict:
"""Return a validated {category, urgent} dict for one customer message."""
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0, # extraction must be repeatable
response_format={"type": "json_object"}, # forces parseable JSON
max_tokens=40, # the reply is tiny; cap the cost
messages=[
{"role": "system", "content": SYSTEM},
# One few-shot pair shows the exact JSON style we expect back.
{"role": "user", "content": "My package never arrived and I need it today!"},
{"role": "assistant", "content": '{"category": "shipping", "urgent": true}'},
# The real message comes last as a lone user turn.
{"role": "user", "content": message},
],
)
data = json.loads(response.choices[0].message.content)
# response_format cannot guarantee the right keys or values, so validate.
# Any unexpected label is folded into "other" rather than trusted blindly.
if data.get("category") not in ALLOWED:
data["category"] = "other"
# Coerce urgent to a real bool in case the model returns "true" as text.
data["urgent"] = bool(data.get("urgent"))
return data
def classify_safely(message: str) -> dict:
"""Wrap classify so one bad response never stops the whole batch."""
try:
return classify(message)
except (json.JSONDecodeError, KeyError) as err:
# Log the failure and return a safe default so the loop continues.
print(f" (could not classify, defaulting to other: {err})")
return {"category": "other", "urgent": False}
if __name__ == "__main__":
inbox = [
"Please refund order #4471, it was the wrong size.",
"This is the best gadget I have ever bought!",
"When will my order ship?",
"Love the product, but it arrived broken — can I get a refund today?",
]
for note in inbox:
result = classify_safely(note)
flag = "URGENT" if result["urgent"] else "normal"
print(f"[{result['category']:>8}] [{flag:>6}] {note}")
Run it and you get a clean, validated label for every message, ready to route into a spreadsheet, a database, or another script. Two details are doing quiet but important work. The validation step means a stray or hallucinated label can never leak downstream; it is rewritten to other. And the classify_safely wrapper means a single malformed reply, which will eventually happen across a large batch, logs a warning and continues instead of crashing the run. That combination of a tight prompt and defensive code is the payoff of prompt engineering: the same prompt, applied at scale, with output you can trust even on the awkward, mixed-intent message.
Next steps
You can now structure prompts, teach by example, enforce JSON, and iterate. Build on that in this order:
- Turn proven prompts into ready-to-paste sets with Prompt Engineering Templates for Marketers.
- Get strict, reliable structure every time by following Write System Prompts that Control Output Format.
- Understand the requests behind the SDK with Understanding LLM APIs.
- Put your classifier to work on real chores in Automating Repetitive Tasks with Python.
Back to Python AI Fundamentals for Non-Developers.