Business Apps

SaaS MVP with Python & AI: A Step-by-Step Build Guide

You have an idea for a product where users type something in, an AI model does the hard part, and they pay you for it. The gap between that idea and a working, billable web service is smaller than it looks, but it has a few sharp edges: you have to know which user is calling, stop one person from running up a bill that wipes out your margin, and record enough about each call to charge for it later. This guide walks you through a minimum viable product (MVP) for exactly that, using Python.

An MVP is the smallest version of your product that real users can pay for. For an AI SaaS, that means one endpoint that takes a request, checks who is asking, makes sure they are allowed to ask, runs the model, and returns a result you can bill. We will build that endpoint with FastAPI (a modern Python web framework that validates requests for you), the openai SDK, and httpx for any other outbound calls. By the end you will have a runnable service and a clear map of what to add next. This guide sits under Building AI-Powered Business Applications, the main guide for turning AI features into products.

Who this is for and what you are building

This is for founders, indie hackers, and creators who can read Python but have never shipped a paid web service. You do not need a front end, a payment processor account, or a cloud provider to follow along. You need Python on your machine and an OpenAI key.

The mental model matters more than any single line of code. Every request to an AI SaaS travels the same path: it arrives, you confirm the caller is a real user (authentication), you confirm they have requests left (rate limiting), you do the expensive AI work, and you record what it cost so you can charge for it (billing). If any link in that chain is missing, you either leak money or you cannot collect it. A service with no auth lets anyone burn your OpenAI credit; one with auth but no rate limit lets a single buggy client loop forever and hand you a four-figure bill overnight; one that does both but records nothing has no way to send an invoice. The diagram below shows the full lifecycle so you can hold it in your head while you read.

AI SaaS request lifecycle A request flows through authentication, rate limiting, the language model call, and usage recording for billing, with a rejection path back to the client. Client request + key Auth who is this? Rate limit any left? LLM call openai SDK Record usage for billing Reject: 401 / 429 auth or rate limit
Every paid AI request passes auth and a rate-limit check before the model runs, and successful calls record usage so you can bill.

Prerequisites

You need Python 3.10 or newer. Check with python --version. If you have never set up an isolated Python workspace, read Create a Python Virtual Environment for AI first, then come back.

Create and activate a virtual environment, then install the four packages this guide uses:

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install "fastapi>=0.110" "uvicorn[standard]>=0.29" "openai>=1.30" "httpx>=0.27"

fastapi is the web framework, uvicorn runs it, openai talks to the model, and httpx handles any other HTTP call you make later. Now store your secret key. Never paste an API key directly into your code, because anything in your code can end up in your Git history where it is hard to remove.

# .env
OPENAI_API_KEY=sk-your-real-key-here

Add .env to your .gitignore immediately so the file with your key is never committed:

echo ".env" >> .gitignore

If you are unsure how API keys and model calls work at all, the Understanding LLM APIs section explains the basics before you wire them into a service.

Step 1: Stand up a FastAPI service that reads your key

Start with the smallest possible service that loads your key and answers a health check. The openai SDK reads OPENAI_API_KEY from the environment automatically, so you only need to load the .env file into the environment first. We will use os.environ and a tiny loader rather than an extra dependency.

# main.py
import os
from pathlib import Path

from fastapi import FastAPI

# Minimal .env loader: read KEY=VALUE lines into the environment.
for line in Path(".env").read_text().splitlines():
    if line and not line.startswith("#") and "=" in line:
        key, value = line.split("=", 1)
        os.environ.setdefault(key.strip(), value.strip())

app = FastAPI(title="AI SaaS MVP")


@app.get("/health")
def health() -> dict:
    has_key = bool(os.environ.get("OPENAI_API_KEY"))
    return {"status": "ok", "openai_key_loaded": has_key}

Run it with uvicorn main:app --reload and open http://127.0.0.1:8000/health. You should see "openai_key_loaded": true. If it says false, your .env file is not where the script is looking: the loader reads Path(".env"), which is resolved relative to the directory you launched uvicorn from, not the directory the file lives in. Start the server from the project root, or pass an absolute path. FastAPI also gives you free interactive docs at http://127.0.0.1:8000/docs, which is where you will test the AI endpoint in the next steps.

A health check is more than a convenience: your hosting platform pings an endpoint like this to decide whether your process is alive and should keep receiving traffic. Keep it cheap and dependency-free, so a temporary OpenAI outage does not make the platform think your whole service is down and restart it in a loop.

Step 2: Authenticate every request

Authentication is just answering "who is this caller?" before you do any work. For an MVP, the simplest reliable scheme is an API key per user: you hand each customer a secret string, they send it on every request, and you look it up. We will keep the user table in a Python dict for now; swapping it for a real database is a later step covered in Add User Authentication to a Python AI App.

FastAPI's dependency system lets you attach this check to any endpoint with one line. The caller sends their key in an X-API-Key header, and the dependency turns that key into a user record or rejects the request with HTTP 401.

# auth.py
from fastapi import Header, HTTPException

# In a real app this lives in a database. Each user has a plan limit.
USERS = {
    "key_free_abc": {"id": "u_1", "plan": "free", "monthly_limit": 20},
    "key_pro_xyz": {"id": "u_2", "plan": "pro", "monthly_limit": 5000},
}


def current_user(x_api_key: str = Header(...)) -> dict:
    user = USERS.get(x_api_key)
    if user is None:
        raise HTTPException(status_code=401, detail="Invalid or missing API key")
    return user

Any endpoint that adds user: dict = Depends(current_user) to its signature is now protected: unauthenticated calls never reach your model. This is the same idea as a login, just expressed as a key instead of a username and password, which suits machine-to-machine SaaS APIs well.

Two details separate a toy version of this from one you can ship. First, store keys hashed, not in plain text: if your user table leaks, plain-text keys hand an attacker every customer's account, whereas a hash reveals nothing usable. Keep a short non-secret prefix in plain text to look the row up quickly, then verify the rest against the hash. Second, compare keys in constant time with Python's secrets.compare_digest, which avoids the timing leaks a naive == on secrets can introduce. Neither matters for a localhost demo, but both are cheap to add before your first real customer.

Step 3: Rate-limit and meter usage

This is the step that protects your bank account. Before you call the model, you check how many requests the user has made this period against their plan limit. If they are over, you reject with HTTP 429 (the standard "too many requests" status) and never touch the paid API. If they are under, you let the call through and increment their counter only after it succeeds, so failed calls do not eat someone's quota.

For an MVP, an in-memory counter is fine. The catch, which trips up almost everyone, is that an in-memory dict resets when the process restarts and is not shared across multiple workers. Once you run more than one process you must move this state to a shared store like Redis, which is the focus of Rate-Limit AI API Calls in a SaaS with Python.

# usage.py
from collections import defaultdict

# user_id -> number of AI calls this period. Reset monthly in production.
_usage: dict[str, int] = defaultdict(int)


def remaining(user: dict) -> int:
    return user["monthly_limit"] - _usage[user["id"]]


def record_call(user: dict) -> None:
    _usage[user["id"]] += 1

Checking before and recording after keeps the logic honest: a user is only charged a request when they actually got a result. We will wire the 429 rejection into the endpoint in the next step.

It helps to separate two limits people often conflate. A plan limit is a business rule, the number of calls a customer paid for this month, resetting on their billing cycle. A burst limit is an abuse guard, a cap on calls per second or minute, that stops one client from hammering you regardless of how much they paid. The counter above handles the plan limit; the burst limit wants a separate short-lived window ("no more than five requests in ten seconds"), which is far easier in Redis than a plain dict because Redis can expire keys for you. Whenever you return a 429, include a Retry-After header so well-behaved clients back off instead of retrying in a tight loop and making the problem worse.

Step 4: Call the model and return a billable result

Now the actual AI work. The openai SDK gives you a client whose chat.completions.create method sends your prompt and returns the model's reply plus a usage object with exact token counts. Those token counts are what you eventually turn into money, so capture them and store them with the call. Returning them in the response also lets your front end show users how much they have spent.

# ai.py
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment


def run_completion(prompt: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a concise, helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        max_tokens=400,
    )
    return {
        "text": response.choices[0].message.content,
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
    }

This function is intentionally narrow: it takes a prompt and returns text plus token counts. Everything else, the auth check, the limit check, and the usage record, wraps around it in the endpoint. The worked example below ties all four steps into one file you can run. To turn those recorded tokens into invoices, see Add Stripe Billing to an AI SaaS with Python.

Two production habits belong here. The first is cost control beyond the call count. Counting requests is blunt, because one user's prompts might be ten times longer than another's and you pay per token, not per request. Even on a flat plan, watch these counts: a spike in average tokens per call is an early warning that someone is pasting huge documents into your prompt and eroding your margin. Capping max_tokens bounds the output side; validating prompt length (as the worked example does with max_length) bounds the input side. Together they put a predictable ceiling on the cost of any single call.

The second habit is failing gracefully. Model calls go over the network, so they will occasionally time out, get rate-limited by the provider, or return a refusal. Wrap the call in a try/except and translate provider errors into clean HTTP responses, so a hiccup at OpenAI surfaces as a tidy 503 rather than an unhandled traceback. Pass a timeout to the client so a stalled request fails fast instead of holding a worker open and blocking other customers.

Parameter reference

These are the values you will tune most often as your MVP grows. The model parameters are passed to chat.completions.create; the others come from the code above.

ParameterTypeDefaultEffect
modelstrgpt-4o-miniWhich model runs. Cheaper models cut cost; larger ones improve quality.
max_tokensint400Caps output length. Lower values reduce cost and runaway responses.
temperaturefloat1.0Randomness of output. Set 0.2 for factual tasks, higher for creative ones.
monthly_limitintper planMaximum AI calls a user may make before getting HTTP 429.
X-API-KeyheaderrequiredThe caller's secret. Resolves to a user or triggers HTTP 401.
timeoutfloatSDK defaultSeconds to wait on the model before failing. Pass to OpenAI(timeout=30).

Troubleshooting

These are the errors you will actually hit while building this service, with the cause and the one-line fix.

  1. openai.AuthenticationError: Incorrect API key provided — Your OPENAI_API_KEY is missing, mistyped, or not loaded. Confirm /health shows openai_key_loaded: true and that the key starts with sk-. See Fix the 401 Unauthorized Error in OpenAI Python.
  2. fastapi returns 422 Unprocessable Entity — The request body did not match your Pydantic model, usually a missing field or wrong type. Check the response detail; it names the exact field that failed.
  3. 401 Invalid or missing API key on every call — You forgot the X-API-Key header, or your key is not in the USERS dict. In /docs, click "Authorize" or add the header manually before sending.
  4. openai.RateLimitError: 429 from OpenAI itself — This is the provider throttling you, not your own limit. Slow your calls or add retries with backoff. See Fix the 429 Rate-Limit Error in Python.
  5. Usage counter resets after a code change--reload restarts the process, wiping the in-memory _usage dict. This is expected in development; move the counter to Redis for anything real.
  6. AttributeError: 'NoneType' object has no attribute 'content' — The model returned no message, often because the request was filtered or max_tokens was 0. Log the full response object and check finish_reason.
  7. openai.APITimeoutError under load — The default timeout is generous, and a slow model call can hold a worker open while other requests queue behind it. Pass an explicit OpenAI(timeout=30) and surface a 503 to the caller so one stalled call cannot back up your whole service.
  8. Two workers report different calls_used for the same user — Each uvicorn worker has its own in-memory usage dict, so the count splits across processes and your limit effectively multiplies by the worker count. The sign you have outgrown the dict and need a shared store like Redis.
  9. FileNotFoundError: '.env' on startup — The loader runs at import time and cannot find the file because the server started from a different directory. Use Path(__file__).parent / ".env" so it resolves next to your code regardless of the working directory.

Worked example: a complete AI SaaS endpoint

This single file ties together all four steps: it loads your key, authenticates the caller, enforces their rate limit, calls the model, records usage, and returns a billable result. Save it as app.py, run uvicorn app:app --reload, and test it from http://127.0.0.1:8000/docs.

# app.py — a minimal, runnable AI SaaS endpoint
import os
from collections import defaultdict
from pathlib import Path

from fastapi import Depends, FastAPI, Header, HTTPException
from fastapi.responses import JSONResponse
from openai import OpenAI, APIError, RateLimitError
from pydantic import BaseModel, Field

# Load .env into the environment (keep .env in .gitignore).
# Resolve the path next to this file so it works from any working directory.
for _line in (Path(__file__).parent / ".env").read_text().splitlines():
    if _line and not _line.startswith("#") and "=" in _line:
        _k, _v = _line.split("=", 1)
        os.environ.setdefault(_k.strip(), _v.strip())

app = FastAPI(title="AI SaaS MVP")
client = OpenAI(timeout=30)  # fail fast instead of holding a worker open

# Stand-in for a user table and a usage table (use a database in production).
USERS = {"key_pro_xyz": {"id": "u_2", "plan": "pro", "monthly_limit": 5000}}
usage: dict[str, int] = defaultdict(int)


class GenerateRequest(BaseModel):
    # max_length bounds the *input* cost; the model's max_tokens bounds output.
    prompt: str = Field(..., min_length=3, max_length=2000)


def current_user(x_api_key: str = Header(...)) -> dict:
    user = USERS.get(x_api_key)  # auth: resolve the key to a user
    if user is None:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return user


@app.get("/health")
def health() -> dict:
    # Cheap, dependency-free check your host pings to decide you are alive.
    return {"status": "ok"}


@app.post("/v1/generate")
def generate(req: GenerateRequest, user: dict = Depends(current_user)) -> dict:
    # Rate limit: refuse before spending money if the user is out of calls.
    if usage[user["id"]] >= user["monthly_limit"]:
        raise HTTPException(
            status_code=429,
            detail="Monthly limit reached",
            headers={"Retry-After": "3600"},  # tell good clients when to retry
        )
    try:
        resp = client.chat.completions.create(  # the paid AI work
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": req.prompt}],
            max_tokens=400,  # caps output length, and therefore cost
        )
    except RateLimitError:  # OpenAI throttled us, not the user's plan
        raise HTTPException(status_code=503, detail="Model busy, retry shortly")
    except APIError:  # any other provider-side failure
        raise HTTPException(status_code=502, detail="Model call failed")

    usage[user["id"]] += 1  # record only after a successful call, for billing
    return {
        "text": resp.choices[0].message.content,
        "input_tokens": resp.usage.prompt_tokens,   # store these per call
        "output_tokens": resp.usage.completion_tokens,
        "tokens": resp.usage.total_tokens,           # the number you bill on
        "calls_used": usage[user["id"]],
        "calls_remaining": user["monthly_limit"] - usage[user["id"]],
    }

Send a POST to /v1/generate with the header X-API-Key: key_pro_xyz and a JSON body like {"prompt": "Write a tagline for a dog-walking app"}. You get back the model's text, the input and output token counts for billing, and how many calls the user has spent and has left. Notice how the four guardrails read top to bottom: validate the body (Pydantic), confirm the caller (Depends), check the limit, do the paid work inside a try, and only then record usage. Each fails closed with a clear status code rather than letting a bad request reach the expensive part. That is a complete, paid AI service in under 60 lines, missing only a real database and a payment processor, both of which are next.

The split between input_tokens and output_tokens is deliberate: providers price the two differently and output is usually dearer. Storing both per call, rather than just the total, lets you later answer "which customers cost us the most to serve" and "is our flat plan still profitable," and it costs nothing to capture now.

Next steps

You have the core loop working. Build out the production pieces in this order, each in its own guide:

  1. Replace the dict-based users with real sign-up and sessions in Add User Authentication to a Python AI App.
  2. Move the usage counter to a shared store so it survives restarts and multiple workers, following Rate-Limit AI API Calls in a SaaS with Python.
  3. Turn recorded usage into revenue with Add Stripe Billing to an AI SaaS with Python.
  4. If your product is conversational, add session memory and prompt design from Custom AI Chatbot Development; if it works with customer records, see CRM Data Integration with AI.

Back to Building AI-Powered Business Applications.