This guide shows you how to turn a flat list of keywords into named topic groups in under fifteen minutes, using OpenAI embeddings to measure meaning, k-means clustering to gather similar keywords, and an LLM to give each group a readable label.
If you have ever exported a thousand keywords from a research tool and stared at a wall of phrases, this is the fix. Instead of sorting by hand, you let the maths group keywords that mean the same thing — even when they share no words — and then ask a model to name what each group is about. The result is a CSV where every keyword carries a group number and a plain-English label you can plan content around.
This is a focused task page under the main SEO Keyword Research with Python guide. If you have not pulled a keyword list yet, the Python Script for Competitor Keyword Analysis guide produces exactly the kind of CSV this one expects as input.
Prerequisites
You need Python 3.10 or newer and an OpenAI API key. If you are setting Python up for the first time, follow Create a Python Virtual Environment for AI first, then come back here.
Create a project folder, activate a virtual environment, and install the four libraries this guide uses:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install pandas numpy scikit-learn openai
Store your API key in a .env file so it never ends up in your code:
OPENAI_API_KEY=sk-your-key-here
Add .env to your .gitignore immediately so you never commit the key by accident. The OpenAI SDK reads OPENAI_API_KEY from the environment automatically, so as long as the variable is set you do not have to pass the key in code. If you hit an authentication wall, the Fix the 401 Unauthorized Error in OpenAI Python guide covers the usual causes.
You also need a keyword file. For this guide, assume a keywords.csv with a single keyword column, one phrase per row.
Step 1: Load your keywords into a DataFrame
Start by reading the CSV into pandas. A DataFrame is just a table in memory — think of a spreadsheet you can manipulate in code. Strip whitespace and drop blanks and duplicates so they do not waste embedding calls.
import pandas as pd
df = pd.read_csv("keywords.csv")
df["keyword"] = df["keyword"].astype(str).str.strip()
df = df[df["keyword"] != ""].drop_duplicates("keyword").reset_index(drop=True)
print(f"Loaded {len(df)} unique keywords")
Keeping the list clean here matters because every keyword you send becomes a billed embedding, and duplicates only inflate the cost without changing the result. The reset_index(drop=True) call renumbers the rows from zero after the filtering, which keeps the DataFrame aligned with the embedding array you build in the next step — if the row order and the vector order ever drift apart, every group label will be wrong, so it is worth getting right now. If your raw export is messy, the Cleaning CSV Data with Pandas for AI guide goes deeper on the tidy-up step, including how to handle stray casing and near-duplicate phrases.
Step 2: Create embeddings with the OpenAI SDK
An embedding turns each keyword into a list of numbers (a vector) that captures its meaning. Two keywords that mean similar things get similar vectors, which is what makes grouping by meaning possible.
Send the keywords in batches rather than one request per keyword — the embedding endpoint accepts many inputs at once, which is far faster and cheaper. The code below batches in chunks of 256 and stacks the results into a single NumPy array, where each row is one keyword's vector.
import numpy as np
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from the environment
EMBED_MODEL = "text-embedding-3-small"
def embed_keywords(keywords: list[str], batch_size: int = 256) -> np.ndarray:
vectors = []
for start in range(0, len(keywords), batch_size):
batch = keywords[start : start + batch_size]
response = client.embeddings.create(model=EMBED_MODEL, input=batch)
vectors.extend(item.embedding for item in response.data)
return np.array(vectors, dtype=np.float32)
embeddings = embed_keywords(df["keyword"].tolist())
print(f"Embedded into a {embeddings.shape} matrix")
The text-embedding-3-small model returns 1,536 numbers per keyword, so a 1,000-keyword list becomes a 1,000-by-1,536 matrix. That is the input k-means needs. Notice that the order of the returned vectors matches the order you sent the keywords in, which is why the cleaning step preserved row order — response.data comes back in the same sequence as your batch. Storing the array as float32 rather than the default float64 halves the memory it uses with no meaningful loss of accuracy for this task. If you see a rate-limit error on a very large list, the Fix the 429 Rate-Limit Error in Python guide shows how to add backoff between batches.
Save the matrix to disk so you never have to pay for the same embeddings twice:
np.save("embeddings.npy", embeddings)
# Reload later with: embeddings = np.load("embeddings.npy")
Step 3: Group the embeddings with k-means
K-means clustering is a classic algorithm that sorts points into a fixed number of groups, where each group is built around its own centre and every point joins the nearest centre. Here the "points" are your keyword vectors, so keywords with similar meaning land in the same group.
You have to tell k-means how many groups to find. A solid starting guess is the square root of your keyword count. The code below normalizes the vectors first — scaling each to the same length — so that distance reflects direction (meaning) rather than magnitude, which works well for text embeddings.
import math
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
n_clusters = max(2, round(math.sqrt(len(df))))
normalized = normalize(embeddings) # unit-length vectors, cosine-friendly
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto")
df["group_id"] = kmeans.fit_predict(normalized)
print(df["group_id"].value_counts().sort_index())
After fit_predict, every row in df has a group_id — an integer from 0 to n_clusters - 1. Keywords sharing a number belong together. The random_state=42 line makes the result reproducible, so re-running the script gives the same groups; change the number or remove it if you want to see how stable the grouping is across runs. The n_init="auto" setting tells scikit-learn to try several starting positions and keep the best one, which guards against k-means settling on a poor arrangement by chance.
Print the value counts to sanity-check the sizes: a healthy result has groups of roughly comparable size. If one group swallows most of your keywords while the rest are tiny, that usually means n_clusters is too low for how varied your list is — raise it and re-run. Because the embeddings are already saved to disk, this clustering step is fast and free to repeat as many times as you like, so treat the group count as a dial to experiment with rather than a value you must get right on the first try.
Step 4: Label each group with an LLM
A group number is not actionable on its own. To make each group readable, send a sample of its keywords to a chat model and ask for a short label. Sampling a handful of keywords per group keeps the prompt small and the cost low while still giving the model enough to work with.
import json
def label_group(keywords: list[str]) -> str:
sample = keywords[:25]
prompt = (
"These keywords belong to one topic group. Reply with a JSON object "
'{"label": "..."} where label is a concise 2-4 word name for the group.\n\n'
+ "\n".join(sample)
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)["label"]
labels = {}
for group_id, rows in df.groupby("group_id"):
labels[group_id] = label_group(rows["keyword"].tolist())
df["group_label"] = df["group_id"].map(labels)
Setting temperature=0 keeps labels stable and literal, and response_format={"type": "json_object"} forces clean JSON so you do not have to parse loose text. The groupby("group_id") loop runs one cheap chat call per group, not one per keyword, so even a list split into thirty groups costs only thirty small requests. Capping the sample at 25 keywords keeps each prompt short while still giving the model a clear picture of the group — sending all of a large group's keywords would cost more without improving the label. If a response ever fails to parse, the Fix JSONDecodeError with AI API Responses in Python guide explains how to harden this. For tips on writing prompts that return tidy, structured output, see Write System Prompts that Control Output Format.
Step 5: Save the labelled result
Finally, write a CSV sorted by group so related keywords sit next to each other. This is the file you actually plan around.
df = df.sort_values(["group_id", "keyword"]).reset_index(drop=True)
df[["group_id", "group_label", "keyword"]].to_csv(
"grouped_keywords.csv", index=False, encoding="utf-8-sig"
)
print(f"Wrote {len(df)} keywords in {df['group_id'].nunique()} groups")
Open grouped_keywords.csv and you will see every keyword tagged with a number and a name like "Beginner Yoga Poses" or "Home Espresso Machines." That is your raw list turned into a map you can build content against.
Parameter quick reference
| Parameter | Typical value | Effect |
|---|---|---|
EMBED_MODEL | text-embedding-3-small | Which model turns keywords into vectors. The -small model is cheap and accurate enough for grouping; text-embedding-3-large is pricier and slightly sharper. |
n_clusters | round(sqrt(n)) | How many groups k-means creates. Raise it for finer, narrower groups; lower it for fewer, broader ones. |
metric / normalize | unit-length (cosine) | Normalizing vectors makes distance reflect meaning rather than length, which is the right choice for text embeddings. |
Troubleshooting
AuthenticationErroron the first embedding call. The SDK could not find a validOPENAI_API_KEY. Confirm the variable is exported in your current shell (echo $OPENAI_API_KEY) and that your.envis loaded. See Fix the 401 Unauthorized Error in OpenAI Python.ValueError: n_samples=... should be >= n_clusters. You asked for more groups than you have keywords. Lowern_clusters, or check that your CSV actually loaded rows — an empty or near-empty list triggers this.- One group contains almost everything. Your keywords may be very similar, or
n_clustersis too low. Raisen_clustersand re-run; the embeddings are saved, so only the fast clustering step repeats. - Labels come back as messy text instead of JSON. Make sure you kept
response_format={"type": "json_object"}and that the word "JSON" appears in the prompt — OpenAI requires both for guaranteed JSON output.
When to use this vs. alternatives
- Use embeddings + k-means when you have a large, mixed keyword list and want groups based on meaning, including synonyms and phrasing the keywords do not literally share. This is the most robust option for messy real-world exports.
- Use simple word or stem matching when your list is small and the keywords are tidy variations of the same root, and you want zero API cost. It is fast but blind to synonyms, so "trainers" and "sneakers" stay apart.
- Use rule-based intent tagging when you care about buyer intent (informational vs. transactional) rather than topic. Grouping by meaning and tagging by intent answer different questions — the Python Script for Competitor Keyword Analysis guide does the intent side.
Once your keywords are grouped, a natural next step is writing the pages: feed each group's label into Generate Meta Descriptions in Bulk with Python to draft snippets at scale. Back to SEO Keyword Research with Python.
Related guides
- SEO Keyword Research with Python — the main guide this task sits under.
- Python Script for Competitor Keyword Analysis — produce the keyword list this guide groups.
- Generate Meta Descriptions in Bulk with Python — turn each group into ready-to-use snippets.
- Cleaning CSV Data with Pandas for AI — tidy a messy keyword export before embedding it.