What is a keyword embedding?

An embedding is a list of numbers that captures the meaning of a phrase. Keywords with similar meaning get similar number lists, so you can measure how close two keywords are even when they share no words.

Do I need a paid OpenAI account to group keywords this way?

Yes, embeddings are billed per token, but the cost is tiny. The text-embedding-3-small model prices a few thousand keywords at well under one US dollar, so a single run is effectively pocket change.

How many groups should I ask k-means for?

Start with roughly the square root of your keyword count, then adjust. If groups feel too broad, raise the number; if you see near-duplicate groups, lower it. There is no single correct value.

Why use embeddings instead of just matching shared words?

Word matching misses synonyms and intent. Embeddings place 'cheap running shoes' and 'affordable trainers' close together even though they share no words, which word matching cannot do.

Can I run this without an internet connection?

The embedding step needs the OpenAI API, so it requires internet. Once you have saved the embeddings to disk, the k-means clustering itself runs fully offline on your machine.

Group Keywords with Python and Embeddings

This guide shows you how to turn a flat list of keywords into named topic groups in under fifteen minutes, using OpenAI embeddings to measure meaning, k-means clustering to gather similar keywords, and an LLM to give each group a readable label.

If you have ever exported a thousand keywords from a research tool and stared at a wall of phrases, this is the fix. Instead of sorting by hand, you let the maths group keywords that mean the same thing — even when they share no words — and then ask a model to name what each group is about. The result is a CSV where every keyword carries a group number and a plain-English label you can plan content around.

This is a focused task page under the main SEO Keyword Research with Python guide. If you have not pulled a keyword list yet, the Python Script for Competitor Keyword Analysis guide produces exactly the kind of CSV this one expects as input.

Prerequisites

You need Python 3.10 or newer and an OpenAI API key. If you are setting Python up for the first time, follow Create a Python Virtual Environment for AI first, then come back here.

Create a project folder, activate a virtual environment, and install the four libraries this guide uses:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install pandas numpy scikit-learn openai

Store your API key in a .env file so it never ends up in your code:

OPENAI_API_KEY=sk-your-key-here

Add .env to your .gitignore immediately so you never commit the key by accident. The OpenAI SDK reads OPENAI_API_KEY from the environment automatically, so as long as the variable is set you do not have to pass the key in code. If you hit an authentication wall, the Fix the 401 Unauthorized Error in OpenAI Python guide covers the usual causes.

You also need a keyword file. For this guide, assume a keywords.csv with a single keyword column, one phrase per row.

Step 1: Load your keywords into a DataFrame

Start by reading the CSV into pandas. A DataFrame is just a table in memory — think of a spreadsheet you can manipulate in code. Strip whitespace and drop blanks and duplicates so they do not waste embedding calls.

import pandas as pd

df = pd.read_csv("keywords.csv")
df["keyword"] = df["keyword"].astype(str).str.strip()
df = df[df["keyword"] != ""].drop_duplicates("keyword").reset_index(drop=True)

print(f"Loaded {len(df)} unique keywords")

Keeping the list clean here matters because every keyword you send becomes a billed embedding, and duplicates only inflate the cost without changing the result. The reset_index(drop=True) call renumbers the rows from zero after the filtering, which keeps the DataFrame aligned with the embedding array you build in the next step — if the row order and the vector order ever drift apart, every group label will be wrong, so it is worth getting right now. If your raw export is messy, the Cleaning CSV Data with Pandas for AI guide goes deeper on the tidy-up step, including how to handle stray casing and near-duplicate phrases.

Step 2: Create embeddings with the OpenAI SDK

An embedding turns each keyword into a list of numbers (a vector) that captures its meaning. Two keywords that mean similar things get similar vectors, which is what makes grouping by meaning possible.

Send the keywords in batches rather than one request per keyword — the embedding endpoint accepts many inputs at once, which is far faster and cheaper. The code below batches in chunks of 256 and stacks the results into a single NumPy array, where each row is one keyword's vector.

import numpy as np
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment
EMBED_MODEL = "text-embedding-3-small"


def embed_keywords(keywords: list[str], batch_size: int = 256) -> np.ndarray:
    vectors = []
    for start in range(0, len(keywords), batch_size):
        batch = keywords[start : start + batch_size]
        response = client.embeddings.create(model=EMBED_MODEL, input=batch)
        vectors.extend(item.embedding for item in response.data)
    return np.array(vectors, dtype=np.float32)


embeddings = embed_keywords(df["keyword"].tolist())
print(f"Embedded into a {embeddings.shape} matrix")

The text-embedding-3-small model returns 1,536 numbers per keyword, so a 1,000-keyword list becomes a 1,000-by-1,536 matrix. That is the input k-means needs. Notice that the order of the returned vectors matches the order you sent the keywords in, which is why the cleaning step preserved row order — response.data comes back in the same sequence as your batch. Storing the array as float32 rather than the default float64 halves the memory it uses with no meaningful loss of accuracy for this task. If you see a rate-limit error on a very large list, the Fix the 429 Rate-Limit Error in Python guide shows how to add backoff between batches.

Save the matrix to disk so you never have to pay for the same embeddings twice:

np.save("embeddings.npy", embeddings)
# Reload later with: embeddings = np.load("embeddings.npy")

Step 3: Group the embeddings with k-means

K-means clustering is a classic algorithm that sorts points into a fixed number of groups, where each group is built around its own centre and every point joins the nearest centre. Here the "points" are your keyword vectors, so keywords with similar meaning land in the same group.

You have to tell k-means how many groups to find. A solid starting guess is the square root of your keyword count. The code below normalizes the vectors first — scaling each to the same length — so that distance reflects direction (meaning) rather than magnitude, which works well for text embeddings.

import math
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

n_clusters = max(2, round(math.sqrt(len(df))))
normalized = normalize(embeddings)  # unit-length vectors, cosine-friendly

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto")
df["group_id"] = kmeans.fit_predict(normalized)

print(df["group_id"].value_counts().sort_index())

After fit_predict, every row in df has a group_id — an integer from 0 to n_clusters - 1. Keywords sharing a number belong together. The random_state=42 line makes the result reproducible, so re-running the script gives the same groups; change the number or remove it if you want to see how stable the grouping is across runs. The n_init="auto" setting tells scikit-learn to try several starting positions and keep the best one, which guards against k-means settling on a poor arrangement by chance.

Print the value counts to sanity-check the sizes: a healthy result has groups of roughly comparable size. If one group swallows most of your keywords while the rest are tiny, that usually means n_clusters is too low for how varied your list is — raise it and re-run. Because the embeddings are already saved to disk, this clustering step is fast and free to repeat as many times as you like, so treat the group count as a dial to experiment with rather than a value you must get right on the first try.

Step 4: Label each group with an LLM

A group number is not actionable on its own. To make each group readable, send a sample of its keywords to a chat model and ask for a short label. Sampling a handful of keywords per group keeps the prompt small and the cost low while still giving the model enough to work with.

import json


def label_group(keywords: list[str]) -> str:
    sample = keywords[:25]
    prompt = (
        "These keywords belong to one topic group. Reply with a JSON object "
        '{"label": "..."} where label is a concise 2-4 word name for the group.\n\n'
        + "\n".join(sample)
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)["label"]


labels = {}
for group_id, rows in df.groupby("group_id"):
    labels[group_id] = label_group(rows["keyword"].tolist())

df["group_label"] = df["group_id"].map(labels)

Setting temperature=0 keeps labels stable and literal, and response_format={"type": "json_object"} forces clean JSON so you do not have to parse loose text. The groupby("group_id") loop runs one cheap chat call per group, not one per keyword, so even a list split into thirty groups costs only thirty small requests. Capping the sample at 25 keywords keeps each prompt short while still giving the model a clear picture of the group — sending all of a large group's keywords would cost more without improving the label. If a response ever fails to parse, the Fix JSONDecodeError with AI API Responses in Python guide explains how to harden this. For tips on writing prompts that return tidy, structured output, see Write System Prompts that Control Output Format.

Step 5: Save the labelled result

Finally, write a CSV sorted by group so related keywords sit next to each other. This is the file you actually plan around.

df = df.sort_values(["group_id", "keyword"]).reset_index(drop=True)
df[["group_id", "group_label", "keyword"]].to_csv(
    "grouped_keywords.csv", index=False, encoding="utf-8-sig"
)
print(f"Wrote {len(df)} keywords in {df['group_id'].nunique()} groups")

Open grouped_keywords.csv and you will see every keyword tagged with a number and a name like "Beginner Yoga Poses" or "Home Espresso Machines." That is your raw list turned into a map you can build content against.

Parameter quick reference

Parameter	Typical value	Effect
`EMBED_MODEL`	`text-embedding-3-small`	Which model turns keywords into vectors. The `-small` model is cheap and accurate enough for grouping; `text-embedding-3-large` is pricier and slightly sharper.
`n_clusters`	`round(sqrt(n))`	How many groups k-means creates. Raise it for finer, narrower groups; lower it for fewer, broader ones.
metric / `normalize`	unit-length (cosine)	Normalizing vectors makes distance reflect meaning rather than length, which is the right choice for text embeddings.

Troubleshooting

AuthenticationError on the first embedding call. The SDK could not find a valid OPENAI_API_KEY. Confirm the variable is exported in your current shell (echo $OPENAI_API_KEY) and that your .env is loaded. See Fix the 401 Unauthorized Error in OpenAI Python.
ValueError: n_samples=... should be >= n_clusters. You asked for more groups than you have keywords. Lower n_clusters, or check that your CSV actually loaded rows — an empty or near-empty list triggers this.
One group contains almost everything. Your keywords may be very similar, or n_clusters is too low. Raise n_clusters and re-run; the embeddings are saved, so only the fast clustering step repeats.
Labels come back as messy text instead of JSON. Make sure you kept response_format={"type": "json_object"} and that the word "JSON" appears in the prompt — OpenAI requires both for guaranteed JSON output.

When to use this vs. alternatives

Use embeddings + k-means when you have a large, mixed keyword list and want groups based on meaning, including synonyms and phrasing the keywords do not literally share. This is the most robust option for messy real-world exports.
Use simple word or stem matching when your list is small and the keywords are tidy variations of the same root, and you want zero API cost. It is fast but blind to synonyms, so "trainers" and "sneakers" stay apart.
Use rule-based intent tagging when you care about buyer intent (informational vs. transactional) rather than topic. Grouping by meaning and tagging by intent answer different questions — the Python Script for Competitor Keyword Analysis guide does the intent side.

Once your keywords are grouped, a natural next step is writing the pages: feed each group's label into Generate Meta Descriptions in Bulk with Python to draft snippets at scale. Back to SEO Keyword Research with Python.

SEO Keyword Research with Python — the main guide this task sits under.
Python Script for Competitor Keyword Analysis — produce the keyword list this guide groups.
Generate Meta Descriptions in Bulk with Python — turn each group into ready-to-use snippets.
Cleaning CSV Data with Pandas for AI — tidy a messy keyword export before embedding it.

Group Keywords with Python and Embeddings

Related pages in this content path

Generate Meta Descriptions in Bulk with Python

Python Script for Competitor Keyword Analysis