Content & Marketing

Python Script for Competitor Keyword Analysis

This guide shows you how to build a Python script that pulls keywords from a competitor's pages, compares them to your own, finds the gaps you should fill, and labels each gap by search intent with an LLM — in about twenty minutes. "Search intent" just means why someone typed a phrase: to learn something, to compare options, or to buy. No SEO subscription required.

The idea is simple. Competitors have already done expensive keyword research, and the words they emphasise on their best pages are a free signal of what works in your niche. We will turn that signal into a tidy spreadsheet of opportunities.

The script comes in four small parts, and each one stands on its own so you can stop early or swap a piece out. First we gather keywords from competitor pages. Then we compare them with the keywords you already target. Next we filter down to the gaps — phrases they use and you do not. Finally, we ask an LLM to label each gap by intent so you know which ones to act on first. By the end you will have a keyword_gaps.csv file ready to drop into your content plan. This work sits inside the wider SEO Keyword Research with Python track, so once the basics click you can layer on grouping and bulk metadata generation.

Prerequisites

You only need a few things beyond a working Python setup. If Python is not installed yet, start with Create a Python Virtual Environment for AI, then come back here.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install httpx beautifulsoup4 pandas openai python-dotenv

We use httpx to fetch pages, beautifulsoup4 to read the HTML, pandas to compare keyword sets, and the openai SDK to label intent. The python-dotenv package loads your API key from a file so it never ends up in your code.

Create a file named .env next to your script with one line:

OPENAI_API_KEY=sk-your-key-here

Add .env to your .gitignore immediately so your secret key is never committed to version control. A leaked key can run up real charges on your account.

echo ".env" >> .gitignore

If you have never made an LLM call before, Understanding LLM APIs walks through the basics, and if your key is rejected, Fix the 401 Unauthorized Error in OpenAI Python covers the usual causes.

Step 1: Gather competitor keywords

First we fetch each competitor page and pull out candidate keywords. We read the page title, the meta description, and every heading (h1 through h3), because those are where a page declares the topics it most wants to rank for. Then we break that text into one-, two-, and three-word phrases (called n-grams) and count how often each appears.

We drop common filler words ("the", "and", "with") so they do not crowd out real keywords. Keeping the keyword-gathering logic in its own function means you can test it on a single URL before running the whole pipeline.

import re
import httpx
import pandas as pd
from bs4 import BeautifulSoup

STOPWORDS = {
    "the", "and", "for", "with", "you", "your", "our", "this", "that",
    "are", "from", "have", "has", "was", "can", "will", "how", "what",
    "http", "https", "www", "com", "org", "click", "read", "more", "all",
}


def gather_keywords(urls: list[str]) -> pd.DataFrame:
    """Fetch each URL and return a DataFrame of keyword + frequency."""
    headers = {"User-Agent": "Mozilla/5.0 (keyword-research-bot)"}
    grams: list[str] = []
    with httpx.Client(headers=headers, timeout=10.0, follow_redirects=True) as client:
        for url in urls:
            resp = client.get(url)
            resp.raise_for_status()
            soup = BeautifulSoup(resp.text, "html.parser")
            parts = [soup.title.string if soup.title else ""]
            meta = soup.find("meta", attrs={"name": "description"})
            if meta and meta.get("content"):
                parts.append(meta["content"])
            parts += [h.get_text() for h in soup.find_all(["h1", "h2", "h3"])]
            text = re.sub(r"[^a-z\s]", " ", " ".join(parts).lower())
            tokens = [t for t in text.split() if t not in STOPWORDS and len(t) > 2]
            for n in (1, 2, 3):
                grams += [" ".join(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]

    df = pd.Series(grams).value_counts().reset_index()
    df.columns = ["keyword", "frequency"]
    return df[df["frequency"] >= 2].reset_index(drop=True)

Test it on one URL before going further:

competitor = gather_keywords(["https://example.com/blog/some-strong-post"])
print(competitor.head(15))

You should see a ranked table of phrases. If most rows look like junk, widen the STOPWORDS set or raise the frequency threshold from 2 to 3.

Step 2: Compare keyword sets with pandas

Now we load your own keywords — the phrases your pages already target — and compare them with the competitor list. Your keywords can come from a one-column CSV you export from your site, your analytics tool, or even a quick list you type by hand. The goal is a single DataFrame that shows, for every competitor phrase, whether you also use it.

def compare_keywords(competitor_df: pd.DataFrame, my_keywords: list[str]) -> pd.DataFrame:
    """Mark which competitor keywords you already cover."""
    mine = {k.strip().lower() for k in my_keywords}
    merged = competitor_df.copy()
    merged["i_cover_it"] = merged["keyword"].isin(mine)
    return merged.sort_values("frequency", ascending=False).reset_index(drop=True)

Load your keywords from a CSV with a single keyword column and run the comparison:

my_df = pd.read_csv("my_keywords.csv")
compared = compare_keywords(competitor, my_df["keyword"].tolist())
print(compared["i_cover_it"].value_counts())

The value_counts() line gives you a quick overview: how many competitor phrases you already cover versus how many you do not. Using a Python set for your keywords keeps the isin lookup fast even with thousands of phrases. If you want to go deeper on pandas itself, Cleaning CSV Data with Pandas for AI covers the tidying steps that make merges like this reliable.

Step 3: Find the gaps

A gap is any competitor phrase where i_cover_it is False. Those are the topics worth adding to your plan. We keep the most frequent gaps first, because a phrase a competitor repeats across headings is one they care about most.

def find_gaps(compared_df: pd.DataFrame, top_n: int = 30) -> pd.DataFrame:
    """Return the competitor keywords you do not yet cover."""
    gaps = compared_df[~compared_df["i_cover_it"]].copy()
    gaps = gaps.drop(columns=["i_cover_it"])
    return gaps.head(top_n).reset_index(drop=True)


gaps = find_gaps(compared, top_n=30)
print(gaps)

The ~ operator means "not", so ~compared_df["i_cover_it"] selects the rows where you do not cover the phrase. Trimming to top_n keeps the output focused; thirty strong opportunities are far more useful than three hundred noisy ones. At this point you already have a usable result — a ranked list of topics to write about next.

Step 4: Label intent with an LLM

Raw gap phrases are more actionable once you know why people search them. We send the whole gap list to an LLM in a single request and ask it to tag each phrase as informational, commercial, or transactional. Batching every keyword into one call keeps the run cheap and fast — one request instead of dozens. We ask for JSON so the answer is easy to parse, and we set temperature=0 so the labels stay consistent between runs.

import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()  # reads OPENAI_API_KEY from the environment


def label_intent(gaps_df: pd.DataFrame) -> pd.DataFrame:
    keywords = gaps_df["keyword"].tolist()
    prompt = (
        "For each keyword, label the search intent as one of: "
        "informational, commercial, or transactional. "
        "Return JSON: {\"results\": [{\"keyword\": ..., \"intent\": ...}]}. "
        f"Keywords: {keywords}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    data = json.loads(resp.choices[0].message.content)
    labels = pd.DataFrame(data["results"])
    return gaps_df.merge(labels, on="keyword", how="left")


labelled = label_intent(gaps)
labelled.to_csv("keyword_gaps.csv", index=False, encoding="utf-8-sig")
print(labelled)

We call load_dotenv() so the key from your .env file is available, then create the client with no arguments — the openai SDK picks up OPENAI_API_KEY automatically, which keeps the secret out of your code. The gpt-4o-mini model is inexpensive and more than capable for short labelling tasks. Saving with encoding="utf-8-sig" makes the CSV open cleanly in Excel.

Now sort by intent to plan your work: informational gaps usually become blog posts, while commercial and transactional gaps point to comparison or product pages.

print(labelled.sort_values("intent").groupby("intent")["keyword"].apply(list))

Putting it together: a runnable script

Here is the whole pipeline as one file you can save as competitor_gaps.py and run from the command line. It accepts competitor URLs and a CSV of your own keywords, then writes a labelled gap report. The argparse block turns it into a small command-line tool so you can re-run it on different competitors without editing code.

#!/usr/bin/env python3
"""Find and label keyword gaps against a competitor."""
import argparse
import pandas as pd
from dotenv import load_dotenv

# Paste gather_keywords, compare_keywords, find_gaps and label_intent here,
# along with their imports (httpx, BeautifulSoup, re, OpenAI, json, os).

load_dotenv()


def main() -> None:
    parser = argparse.ArgumentParser(description="Competitor keyword gap finder")
    parser.add_argument("--urls", nargs="+", required=True, help="Competitor page URLs")
    parser.add_argument("--mine", required=True, help="CSV with a 'keyword' column")
    parser.add_argument("--out", default="keyword_gaps.csv", help="Output CSV path")
    parser.add_argument("--top", type=int, default=30, help="How many gaps to keep")
    args = parser.parse_args()

    competitor = gather_keywords(args.urls)
    my_keywords = pd.read_csv(args.mine)["keyword"].tolist()
    compared = compare_keywords(competitor, my_keywords)
    gaps = find_gaps(compared, top_n=args.top)
    labelled = label_intent(gaps)
    labelled.to_csv(args.out, index=False, encoding="utf-8-sig")
    print(f"Wrote {len(labelled)} keyword gaps to {args.out}")


if __name__ == "__main__":
    main()

Run it like this, passing one or more competitor URLs:

python competitor_gaps.py \
  --urls https://competitor.com/blog/post-a https://competitor.com/blog/post-b \
  --mine my_keywords.csv \
  --out keyword_gaps.csv

The script prints a confirmation line and leaves a CSV you can open in any spreadsheet. Because each stage is its own function, you can swap out a single piece — for example, replace gather_keywords with a reader that pulls phrases from an export instead of live pages — without touching the rest of the pipeline.

Key parameters quick reference

ParameterWhereDefaultEffect
frequency >= 2gather_keywords2Minimum times a phrase must appear to be kept. Raise it to cut noise on large pages.
top_nfind_gaps30How many top gaps to keep. Lower it for a tighter content plan.
modellabel_intentgpt-4o-miniThe LLM used for labelling. Cheap and accurate for short phrases.
temperaturelabel_intent0Controls randomness. Keep at 0 so the same keyword always gets the same label.

Troubleshooting

  1. httpx.HTTPStatusError: 403 Forbidden — the competitor's server blocked the request. Cause: a missing or default user agent. Fix: keep the custom User-Agent header shown above, and slow down by fetching only a few pages at a time.
  2. Empty or junk keyword table — the page loaded its content with JavaScript, so the raw HTML has little text. Cause: client-rendered pages. Fix: pick article or blog URLs that render text server-side, or target the competitor's older static pages.
  3. json.decoder.JSONDecodeError after the LLM call — the model returned text that is not valid JSON. Cause: the response_format line was removed or the model wandered off format. Fix: keep response_format={"type": "json_object"} and temperature=0. See Fix JSONDecodeError with AI API Responses in Python.
  4. RateLimitError (429) on the LLM call — too many requests in a short window. Cause: looping a separate call per keyword instead of batching. Fix: send all keywords in one request as shown, and if it persists, read Fix the 429 Rate-Limit Error in Python.

When to use this vs. alternatives

  • Use this Python script when you want a free, repeatable gap analysis you can run on any set of competitor pages and pipe straight into a content plan. It is ideal for small teams without a paid SEO seat.
  • Use a paid SEO tool (Ahrefs, Semrush) when you specifically need search-volume and ranking-difficulty numbers. This script tells you which topics a competitor emphasises, not how many people search them each month — combine the two for the full picture.
  • Group the results instead of listing them when you have hundreds of gaps and want themes rather than a flat list. Feed your gap CSV into Group Keywords with Python and Embeddings, which uses clustering to bundle related phrases into topics automatically.

Once you have your gap list, the natural next step is to turn each opportunity into a page — and to write the snippets that get it clicked with Generate Meta Descriptions in Bulk with Python.

Back to SEO Keyword Research with Python.