Content & Marketing

Python Script for Competitor Keyword Analysis: Exact Execution Guide

Python Script for Competitor Keyword Analysis: Exact Execution Guide

This script extracts, deduplicates, and classifies competitor keywords in under 60 seconds using Python and lightweight AI prompting. It eliminates dependency bloat and executes immediately via CLI, delivering a structured dataset ready for content strategy.

Environment Setup & Core Dependencies

Run these commands to isolate dependencies and install the exact stack required. Python 3.9+ is mandatory for native zoneinfo and improved type hinting compatibility.

python -m venv competitor_kw_env
source competitor_kw_env/bin/activate # Windows: competitor_kw_env\Scripts\activate
pip install requests beautifulsoup4 pandas openai spacy
python -m spacy download en_core_web_sm

Foundational data structuring aligns with established SEO Keyword Research with Python methodologies, preventing API throttling while guaranteeing clean DataFrame outputs.

Pipeline Architecture & Function Breakdown

The architecture follows a strict 3-phase modular design:

  1. fetch_html(urls: list) -> list: HTTP retrieval with retry/backoff.
  2. extract_keywords(html_docs: list) -> pd.DataFrame: DOM parsing, n-gram generation, and deduplication.
  3. classify_intent(df: pd.DataFrame, key: str) -> pd.DataFrame: Batch LLM prompting with JSON schema enforcement and fallback routing.

Each function maintains pure input/output contracts to enable isolated testing and pipeline swapping.

Phase 1: Fetch & Sanitize Competitor HTML

Implements requests with custom headers, exponential backoff on 403/429, and targeted BS4 extraction.

import requests, time, re
from bs4 import BeautifulSoup

def fetch_html(urls, retries=3, base_delay=1.0):
 headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
 docs = []
 for url in urls:
 for attempt in range(retries):
 try:
 res = requests.get(url, headers=headers, timeout=10)
 res.raise_for_status()
 soup = BeautifulSoup(res.text, "html.parser")
 for tag in soup(["script", "style"]): tag.decompose()
 text = " ".join([soup.title.string or "",
 soup.find("meta", {"name": "description"})["content"] if soup.find("meta", {"name": "description"}) else "",
 *(h.get_text() for h in soup.find_all(["h1","h2","h3"]))])
 docs.append(re.sub(r"\s+", " ", text).strip())
 break
 except requests.exceptions.HTTPError as e:
 if res.status_code in [403, 429]:
 time.sleep(base_delay * (2 ** attempt))
 else: raise
 return docs

Phase 2: Extract & Deduplicate Keywords

Uses regex tokenization, pandas frequency aggregation, and spaCy lemmatization. Filters to 1–3 grams with min_count=2.

import pandas as pd, spacy
nlp = spacy.load("en_core_web_sm")
STOPWORDS = set(nlp.Defaults.stop_words) | {"http", "www", "com", "org", "click", "read"}

def extract_keywords(docs):
 grams = []
 for doc in docs:
 tokens = [t.lemma_.lower() for t in nlp(doc) if t.is_alpha and t.lemma_.lower() not in STOPWORDS]
 for n in range(1, 4):
 grams.extend([" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)])
 df = pd.Series(grams).value_counts().reset_index()
 df.columns = ["keyword", "frequency"]
 return df[df["frequency"] >= 2].reset_index(drop=True)

Phase 3: AI Intent Classification

Batches keywords into an OpenAI call with strict JSON schema. Falls back to rule-based scoring on API failure.

import openai, json

def classify_intent(df, api_key):
 openai.api_key = api_key
 prompt = f"Classify these keywords into informational, commercial, or transactional. Return ONLY a JSON array of objects: {df['keyword'].tolist()}"
 try:
 res = openai.chat.completions.create(
 model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}],
 temperature=0, response_format={"type": "json_object"}
 )
 data = json.loads(res.choices[0].message.content)
 return pd.DataFrame(data)
 except Exception:
 df["intent"] = df["keyword"].apply(lambda x: "transactional" if any(w in x for w in ["buy","price","deal","cost"]) else "informational")
 df["confidence"] = 0.7
 return df

Complete Executable Script

Consolidate all phases into a single, runnable Python file. Save as competitor_kw.py.

#!/usr/bin/env python3
import argparse, pandas as pd, requests, time, re, json, openai
from bs4 import BeautifulSoup
import spacy

# [Paste Phase 1, 2, 3 functions here]
nlp = spacy.load("en_core_web_sm")
STOPWORDS = set(nlp.Defaults.stop_words) | {"http", "www", "com", "org", "click", "read"}

def main():
 parser = argparse.ArgumentParser(description="Competitor Keyword Extractor")
 parser.add_argument("--urls", nargs="+", required=True)
 parser.add_argument("--output", default="competitor_keywords.csv")
 parser.add_argument("--openai_key", required=True)
 args = parser.parse_args()

 html_docs = fetch_html(args.urls)
 kw_df = extract_keywords(html_docs)
 intent_df = classify_intent(kw_df, args.openai_key)
 
 final = kw_df.merge(intent_df, on="keyword", how="left", suffixes=("", "_ai"))
 final = final.rename(columns={"intent_ai": "intent", "confidence_ai": "confidence_score"})
 final["source_url"] = ", ".join(args.urls)
 final.to_csv(args.output, index=False, encoding="utf-8-sig")
 print(f"Exported {len(final)} keywords to {args.output}")

if __name__ == "__main__":
 main()

Execute via: python competitor_kw.py --urls https://competitor1.com --output results.csv --openai_key sk-...

Validation & Output Formatting

Verify output integrity before downstream use.

df = pd.read_csv("competitor_keywords.csv", encoding="utf-8-sig")
assert df["keyword"].notna().all(), "Missing keywords"
assert df["intent"].isin(["informational", "commercial", "transactional"]).all(), "Invalid intents"
print(df["intent"].value_counts(normalize=True))

Validation Checklist:

  1. Row count matches expected extraction volume.
  2. Intent distribution reflects realistic ratios (e.g., ~60% informational).
  3. Zero NaN values in keyword, intent, or confidence_score.

Scaling & Workflow Integration

Automate daily execution via cron (0 6 * * * cd /path && python competitor_kw.py ...) or GitHub Actions. Pipe the resulting CSV directly into content brief generators or ad campaign planners. This ingestion layer integrates seamlessly into broader AI Content Creation & Marketing Automation ecosystems, enabling automated clustering, brief generation, and performance tracking without manual data wrangling.