Manual keyword research falls apart the moment your spreadsheet passes a few hundred rows. You scroll, you eyeball, you guess which terms belong together, and you miss obvious overlaps because two phrases happen to use different words for the same thing. The classic failure mode is duplication: you write one page about "cheap flights" and a second about "budget airfare" without realising they serve the same searcher, so the two pages compete with each other in Google instead of one strong page winning. A Python workflow fixes this: it reads your keywords, measures how similar they are in meaning, groups the related ones automatically, and ranks the groups so you know what to write first.
This guide is for creators, marketers, and founders who can run a Python script but are not full-time developers. By the end you will have a small program that takes a flat list of search terms and returns tidy groups of related keywords, each scored by how worthwhile it is to target. You do not need a degree in machine learning, and you do not need an expensive all-in-one SEO platform. You need a list of keywords, an OpenAI API key, and about thirty minutes.
We will build the pipeline in four steps: fetch your keywords, turn each one into numbers that capture its meaning (an embedding), group the related ones with a standard clustering algorithm, and finally prioritise the groups. Each step is a small, self-contained function, so you can run them one at a time in a notebook while you learn, then chain them once you trust the output — the same four functions become the worked example at the end of the page. This page sits under the AI Content Creation & Marketing Automation hub, and it feeds directly into the writing guides over in AI Copywriting Workflows.
Prerequisites
You need Python 3.10 or newer. Check your version with python --version. If it prints anything lower, follow How to Install Python for AI on Windows or How to Install Python for AI Projects on Mac first.
Create an isolated workspace so these libraries do not collide with other projects. If virtual environments are new to you, Create a Python Virtual Environment for AI walks through it.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install openai httpx pandas scikit-learn numpy python-dotenv
You will need an OpenAI API key. Create one in your OpenAI account dashboard, then store it in a file named .env in your project folder. A .env file keeps secrets out of your code so you never paste a key into a script by accident.
OPENAI_API_KEY=sk-your-real-key-here
Add .env to your .gitignore immediately so the key is never committed to version control. One careless git push with a live key in it is the single most common way beginners leak credentials.
echo ".env" >> .gitignore
That is the entire setup. Everything below assumes this environment is active, which you can confirm at any time: the shell prompt should show (.venv) and pip show openai should print a version number rather than "not found". If you close your terminal and come back later, re-run source .venv/bin/activate first — the activation only lasts for the current session.
One cost note before we start, because beginners often worry about surprise bills. Embedding text is one of the cheapest things you can do with the OpenAI API: text-embedding-3-small costs roughly two cents per million tokens, and an average keyword is only a few tokens long, so embedding ten thousand keywords costs a fraction of a cent. The grouping and prioritising steps run entirely on your own machine with scikit-learn and cost nothing. You can run this whole workflow dozens of times while you experiment without it showing up meaningfully on your bill.
Step 1: Fetch and clean your keyword list
Every workflow starts with a flat list of search terms. The easiest source is a CSV export from Google Search Console, a free keyword tool, or a spreadsheet you already keep. If you want to pull data from a SERP API instead, the httpx pattern below works for any provider that returns JSON. We prefer httpx over the older requests library because it has the same friendly API plus built-in timeouts and modern HTTP handling.
import httpx
import pandas as pd
def load_keywords_from_csv(path: str) -> pd.DataFrame:
"""Load keywords from a CSV with at least a 'keyword' column."""
df = pd.read_csv(path)
df["keyword"] = df["keyword"].astype(str).str.strip().str.lower()
df = df.drop_duplicates(subset="keyword")
df = df[df["keyword"].str.len() > 2].reset_index(drop=True)
return df
def fetch_keywords_from_api(seed: str, api_key: str) -> pd.DataFrame:
"""Example SERP-API call. Swap the URL and field names for your provider."""
response = httpx.get(
"https://api.your-serp-provider.com/v1/keywords",
params={"q": seed, "api_key": api_key, "limit": 100},
timeout=30.0,
)
response.raise_for_status()
rows = response.json()["data"]
return pd.DataFrame(rows)
keywords_df = load_keywords_from_csv("keywords.csv")
print(f"Loaded {len(keywords_df)} unique keywords")
The cleaning matters more than it looks, and it is worth understanding each line rather than copying it blindly. astype(str) guards against pandas reading a column of mostly-numbers as integers and choking on a real phrase. str.strip() removes leading and trailing spaces, which are invisible in a spreadsheet but make "python seo" and "python seo " look like two different terms. str.lower() folds case so "Python SEO" collapses into the same term. drop_duplicates then removes the exact repeats those steps just exposed, often a surprising fraction of a raw export. Finally, the length filter drops rows shorter than three characters, clearing out stray single letters, empty cells that survived as "nan" strings, and other junk that would otherwise waste an embedding call and pollute a group.
A CSV with a keyword column (and ideally a volume column for monthly search volume) is all the rest of this guide needs. If your export carries extra columns — clicks, impressions, position, difficulty — leave them in. They ride along untouched through the whole pipeline and are waiting in the final output, which means you can sort or filter on them later without re-running anything. The one column the prioritising step looks for by name is volume; everything else is along for the trip.
Step 2: Turn each keyword into an embedding
An embedding is a list of numbers that represents the meaning of a piece of text. The OpenAI API reads each keyword and returns a vector — for text-embedding-3-small that is 1,536 numbers — positioned so that phrases with similar meaning end up close together in mathematical space. A helpful mental picture: imagine every keyword as a pin dropped onto an enormous map with 1,536 dimensions instead of two. Pins for related ideas land near each other, pins for unrelated ideas land far apart, and the distance between two pins measures how similar their meanings are. This is what lets a computer see that "cheap flights" and "budget airfare" belong together even though they share no words, and that "python tutorial" and "python snake care" belong apart even though they share one.
The endpoint accepts a whole batch of texts in one call, so embedding a few thousand keywords is a handful of requests, not thousands. We use the official openai SDK, which reads your key from the environment automatically.
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI() # reads OPENAI_API_KEY from the environment
def embed_keywords(keywords: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
"""Return a 2-D array where each row is one keyword's embedding."""
vectors: list[list[float]] = []
for start in range(0, len(keywords), 1000):
batch = keywords[start : start + 1000]
response = client.embeddings.create(model=model, input=batch)
vectors.extend(item.embedding for item in response.data)
return np.array(vectors, dtype="float32")
embeddings = embed_keywords(keywords_df["keyword"].tolist())
print(f"Embedded into shape {embeddings.shape}")
Batching in chunks of 1,000 keeps each request comfortably under the API's per-request input limit of 2,048 inputs and well within the token budget per call. Just as importantly, the order is preserved: the API returns embeddings in exactly the order you sent the keywords, so row five of the array always corresponds to row five of the DataFrame. That alignment is what lets the later steps match a vector back to its keyword by position alone, with no IDs to track. We store the result as float32 rather than the default float64 because it halves the memory footprint with no measurable loss in grouping quality — a real saving once lists run into the hundreds of thousands.
If you process tens of thousands of keywords, wrap the call in a try/except and add a brief time.sleep between batches so a momentary rate limit does not abandon work you have already paid for. The result is a NumPy array with one row per keyword, ready to feed straight into the grouping step.
Step 3: Group related keywords with k-means
Now we let the machine find the structure. K-means clustering is a classic algorithm that sorts points into a chosen number of groups so that each point sits with its nearest neighbours. Applied to keyword embeddings, it produces groups of related keywords — the natural topic pages you should consider building.
You pick how many groups to create with the n_clusters setting. A useful rule of thumb is one group for every 15 to 25 keywords; we compute that automatically below so the number scales with your list size. The intuition is simple: a page targeting fewer than fifteen related terms is usually too thin to rank well, while one stretched across more than twenty-five tries to answer too many questions at once. Because the count is derived from len(df), a list of 600 keywords produces about 30 groups and a list of 1,200 about 60 — the granularity stays constant as your data grows.
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
def group_keywords(df: pd.DataFrame, embeddings: np.ndarray, per_group: int = 20) -> pd.DataFrame:
"""Add a 'group_id' column by clustering the embeddings."""
n_clusters = max(2, len(df) // per_group)
# Normalising makes k-means group by direction (meaning), not magnitude.
unit_vectors = normalize(embeddings)
model = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto")
df = df.copy()
df["group_id"] = model.fit_predict(unit_vectors)
return df
grouped_df = group_keywords(keywords_df, embeddings)
print(grouped_df.groupby("group_id").size().sort_values(ascending=False).head())
Normalising the vectors before clustering is a small but important detail. By default k-means measures straight-line distance, which is sensitive to how long each vector is as well as which way it points. For embeddings the length is mostly noise and the direction carries the meaning, so normalize() rescales every vector to the same length and leaves only its direction to compare — the difference between groups that track topics and groups that cluster by wording quirks. The random_state=42 argument fixes the random seed, so the same input produces the same groups every run, and n_init="auto" lets scikit-learn restart a sensible number of times and keep the best result, guarding against an unlucky start that lands in a poor split.
If you want a deeper treatment of this exact technique, including how to test several values of n_clusters and how to spot when k-means is the wrong tool, the dedicated guide Group Keywords with Python and Embeddings covers tuning and alternatives.
Step 4: Label and prioritise each group
A group of keyword IDs is not actionable until you can read it and rank it. Two things make it useful: a human-readable label (the keyword closest to the group's centre is a great summary) and a score that tells you which group to write for first. A simple, honest score is total search volume across the group — that is the size of the audience a page targeting it could reach.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def summarise_groups(df: pd.DataFrame, embeddings: np.ndarray) -> pd.DataFrame:
summaries = []
for group_id, members in df.groupby("group_id"):
idx = members.index.to_numpy()
centre = embeddings[idx].mean(axis=0, keepdims=True)
closest = idx[cosine_similarity(centre, embeddings[idx])[0].argmax()]
label = df.loc[closest, "keyword"]
volume = int(members["volume"].sum()) if "volume" in df else len(members)
summaries.append(
{"group_id": group_id, "label": label,
"keywords": len(members), "opportunity": volume}
)
return pd.DataFrame(summaries).sort_values("opportunity", ascending=False)
report = summarise_groups(grouped_df, embeddings)
print(report.head(10).to_string(index=False))
The labelling logic is worth pausing on, because it is what turns an anonymous group number into something you can act on. For each group we average all its member vectors to find the group's centre — the mathematical "middle" of that cluster of meanings — and then pick the single real keyword closest to that centre using cosine similarity. That keyword is the most representative phrase in the group, a far better label than an arbitrary first row.
The output is a ranked table: each row is one topic page, named after its most representative keyword, with the number of keywords it covers and an opportunity score. Write for the top rows first. If your CSV has no volume column the score falls back to group size, which still surfaces the broadest, most-supported topics. Search volume is an honest first-pass score, but two refinements pay off quickly: weight the score by how easy each group looks to rank for if your export carries a difficulty column, and skim the smallest groups by hand, because a tight cluster of three high-intent buying terms can be worth more than a sprawling group of fifty informational ones. The numbers point you at the opportunities; your judgement still picks the order.
Parameter reference
| Name | Type | Default | Effect |
|---|---|---|---|
model (embeddings) | str | "text-embedding-3-small" | Which OpenAI embedding model to use. -small is cheap and accurate; -large is more precise but slower and pricier. |
input (embeddings) | list[str] | required | The batch of keywords to embed. Up to 2,048 items per request. |
n_clusters | int | computed | How many groups k-means creates. More groups means tighter, more specific topics. |
per_group | int | 20 | Target keywords per group; drives the n_clusters calculation. Lower it for finer splits. |
n_init | str / int | "auto" | How many times k-means restarts to find the best result. "auto" is the recommended modern default. |
random_state | int | 42 | Fixes the random seed so you get the same groups on every run. |
timeout (httpx) | float | 30.0 | Seconds before an API request is abandoned. Raise it on slow connections. |
Troubleshooting
openai.AuthenticationError: Incorrect API key provided— The key in.envis wrong, expired, or not being loaded. Confirmload_dotenv()runs before you create the client, and check there are no quotes or spaces around the key. The full walk-through is in Fix the 401 Unauthorized Error in OpenAI Python.openai.RateLimitError: Rate limit reached— You sent batches too quickly. Add a shorttime.sleep(1)between requests, or shrink the batch size. See Fix the 429 Rate-Limit Error in Python for a retry pattern.KeyError: 'keyword'— Your CSV's column is named something else (oftenKeywordorQuery). Rename it withdf = df.rename(columns={"Query": "keyword"})right after loading, or set the correct name inpd.read_csv.ValueError: n_samples=… should be >= n_clusters=…— You asked for more groups than you have keywords. This happens on tiny lists; themax(2, len(df) // per_group)guard prevents it, so make sure you are using the helper rather than a hard-codedn_clusters.httpx.ReadTimeout— The SERP API took longer than yourtimeout. Raise it to60.0, and confirm the endpoint URL is correct — a wrong path often hangs instead of returning an error.- Groups look random or mixed — You probably skipped normalising the vectors, or your keyword list is too small for embeddings to find structure. Confirm
normalize()runs beforefit_predict, and aim for at least a few hundred keywords. ModuleNotFoundError: No module named 'openai'(orsklearn) — The library is missing from the active environment, almost always because the virtual environment is not activated or you installed into a different one. Re-runsource .venv/bin/activate, confirm the prompt shows(.venv), then re-run thepip installline. Note the import name for scikit-learn issklearn, not the package namescikit-learn.- The same keyword shows up as the label for several groups — This means your list has near-duplicate keywords that survived cleaning (for example with trailing punctuation), so distinct groups end up centred on the same phrase. Tighten the cleaning step, or lower
per_groupso the algorithm draws sharper boundaries, and re-run. TypeError: 'NoneType' object is not subscriptableonresponse.json()— The SERP API returned an error body without thedatafield you indexed. Printresponse.textbefore parsing to see the real message; it is usually an auth failure or a malformed query parameter rather than a bug in your code.
Worked example: full pipeline in one script
This script ties the four steps together. Save it as keyword_research.py, put a keywords.csv (with keyword and optional volume columns) beside it, and run python keyword_research.py. It writes a ranked keyword_groups.csv you can open in any spreadsheet.
import os
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
load_dotenv() # load OPENAI_API_KEY from .env into the environment
client = OpenAI() # SDK reads the key automatically; no key in code
# 1. Load and clean the keyword list.
df = pd.read_csv("keywords.csv") # expects a 'keyword' column
df["keyword"] = df["keyword"].astype(str).str.strip().str.lower() # normalise casing/spaces
df = df.drop_duplicates("keyword") # remove the repeats that exposes
df = df[df["keyword"].str.len() > 2].reset_index(drop=True) # drop junk; reset index to 0..n
print(f"Loaded {len(df)} keywords")
# 2. Embed every keyword in batches of 1,000 (well under the 2,048-input limit).
vectors: list[list[float]] = []
for start in range(0, len(df), 1000): # walk the list in 1,000-row windows
batch = df["keyword"].iloc[start : start + 1000].tolist()
resp = client.embeddings.create(model="text-embedding-3-small", input=batch)
vectors.extend(item.embedding for item in resp.data) # order is preserved by the API
embeddings = normalize(np.array(vectors, dtype="float32")) # to unit vectors: compare by meaning
# 3. Group related keywords with k-means (≈20 keywords per group).
n_clusters = max(2, len(df) // 20) # scales with list size; min of 2
df["group_id"] = KMeans(n_clusters=n_clusters, random_state=42,
n_init="auto").fit_predict(embeddings) # stable, reproducible groups
# 4. Label each group by its central keyword and score it by volume.
rows = []
for gid, members in df.groupby("group_id"): # one iteration per group
idx = members.index.to_numpy() # row positions of this group
centre = embeddings[idx].mean(axis=0, keepdims=True) # the group's average direction
# The member closest to that centre is the most representative phrase -> the label.
label = df.loc[idx[cosine_similarity(centre, embeddings[idx])[0].argmax()], "keyword"]
volume = int(members["volume"].sum()) if "volume" in df else len(members) # score, with fallback
rows.append({"group": gid, "topic": label, "keywords": len(members), "opportunity": volume})
report = pd.DataFrame(rows).sort_values("opportunity", ascending=False) # best opportunities first
report.to_csv("keyword_groups.csv", index=False) # open this in any spreadsheet
print(report.head(10).to_string(index=False))
Thirty lines turn a messy export into a prioritised content plan. Run it again whenever your keyword list grows; the fixed random_state keeps the groups stable between runs, so new keywords slot into existing groups rather than reshuffling everything you already planned.
To read the result, open keyword_groups.csv and start at the top. Each row names a page to write, the topic column gives you a working title, the keywords column tells you how much material you have to cover it, and opportunity ranks the rows by audience size. To inspect the keywords inside a single group rather than just its label, add a line like df[df["group_id"] == 3].to_csv("group_3.csv", index=False) to dump that group's full membership — the fastest way to sanity-check that the grouping matches your own sense of the topic before you commit to writing.
Next steps
You now have ranked groups of related keywords. Here is where each part of the workflow goes deeper:
- Find what competitors rank for that you do not. Run the Python Script for Competitor Keyword Analysis to fill gaps in your list before you embed it.
- Tune the grouping. Group Keywords with Python and Embeddings covers choosing the right number of groups and labelling them more cleanly.
- Turn topics into pages. Feed your top groups into Generate Meta Descriptions in Bulk with Python and the drafting recipes in AI Copywriting Workflows.
- Distribute what you publish. Once pages are live, push them out with Automated Social Media Posting.
Back to AI Content Creation & Marketing Automation.
Related guides
- Python Script for Competitor Keyword Analysis — find the terms rivals rank for and you miss.
- Group Keywords with Python and Embeddings — a deeper dive into the clustering step.
- Generate Meta Descriptions in Bulk with Python — turn each group into publish-ready metadata.
- AI Copywriting Workflows — draft the pages your keyword groups point to.
- AI Content Creation & Marketing Automation — the main hub for every guide in this track.