Python Script for Competitor Keyword Analysis: Exact Execution Guide
This script extracts, deduplicates, and classifies competitor keywords in under 60 seconds using Python and lightweight AI prompting. It eliminates dependency bloat and executes immediately via CLI, delivering a structured dataset ready for content strategy.
Environment Setup & Core Dependencies
Run these commands to isolate dependencies and install the exact stack required. Python 3.9+ is mandatory for native zoneinfo and improved type hinting compatibility.
python -m venv competitor_kw_env
source competitor_kw_env/bin/activate # Windows: competitor_kw_env\Scripts\activate
pip install requests beautifulsoup4 pandas openai spacy
python -m spacy download en_core_web_sm
Foundational data structuring aligns with established SEO Keyword Research with Python methodologies, preventing API throttling while guaranteeing clean DataFrame outputs.
Pipeline Architecture & Function Breakdown
The architecture follows a strict 3-phase modular design:
fetch_html(urls: list) -> list: HTTP retrieval with retry/backoff.extract_keywords(html_docs: list) -> pd.DataFrame: DOM parsing, n-gram generation, and deduplication.classify_intent(df: pd.DataFrame, key: str) -> pd.DataFrame: Batch LLM prompting with JSON schema enforcement and fallback routing.
Each function maintains pure input/output contracts to enable isolated testing and pipeline swapping.
Phase 1: Fetch & Sanitize Competitor HTML
Implements requests with custom headers, exponential backoff on 403/429, and targeted BS4 extraction.
import requests, time, re
from bs4 import BeautifulSoup
def fetch_html(urls, retries=3, base_delay=1.0):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
docs = []
for url in urls:
for attempt in range(retries):
try:
res = requests.get(url, headers=headers, timeout=10)
res.raise_for_status()
soup = BeautifulSoup(res.text, "html.parser")
for tag in soup(["script", "style"]): tag.decompose()
text = " ".join([soup.title.string or "",
soup.find("meta", {"name": "description"})["content"] if soup.find("meta", {"name": "description"}) else "",
*(h.get_text() for h in soup.find_all(["h1","h2","h3"]))])
docs.append(re.sub(r"\s+", " ", text).strip())
break
except requests.exceptions.HTTPError as e:
if res.status_code in [403, 429]:
time.sleep(base_delay * (2 ** attempt))
else: raise
return docs
Phase 2: Extract & Deduplicate Keywords
Uses regex tokenization, pandas frequency aggregation, and spaCy lemmatization. Filters to 1–3 grams with min_count=2.
import pandas as pd, spacy
nlp = spacy.load("en_core_web_sm")
STOPWORDS = set(nlp.Defaults.stop_words) | {"http", "www", "com", "org", "click", "read"}
def extract_keywords(docs):
grams = []
for doc in docs:
tokens = [t.lemma_.lower() for t in nlp(doc) if t.is_alpha and t.lemma_.lower() not in STOPWORDS]
for n in range(1, 4):
grams.extend([" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)])
df = pd.Series(grams).value_counts().reset_index()
df.columns = ["keyword", "frequency"]
return df[df["frequency"] >= 2].reset_index(drop=True)
Phase 3: AI Intent Classification
Batches keywords into an OpenAI call with strict JSON schema. Falls back to rule-based scoring on API failure.
import openai, json
def classify_intent(df, api_key):
openai.api_key = api_key
prompt = f"Classify these keywords into informational, commercial, or transactional. Return ONLY a JSON array of objects: {df['keyword'].tolist()}"
try:
res = openai.chat.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}],
temperature=0, response_format={"type": "json_object"}
)
data = json.loads(res.choices[0].message.content)
return pd.DataFrame(data)
except Exception:
df["intent"] = df["keyword"].apply(lambda x: "transactional" if any(w in x for w in ["buy","price","deal","cost"]) else "informational")
df["confidence"] = 0.7
return df
Complete Executable Script
Consolidate all phases into a single, runnable Python file. Save as competitor_kw.py.
#!/usr/bin/env python3
import argparse, pandas as pd, requests, time, re, json, openai
from bs4 import BeautifulSoup
import spacy
# [Paste Phase 1, 2, 3 functions here]
nlp = spacy.load("en_core_web_sm")
STOPWORDS = set(nlp.Defaults.stop_words) | {"http", "www", "com", "org", "click", "read"}
def main():
parser = argparse.ArgumentParser(description="Competitor Keyword Extractor")
parser.add_argument("--urls", nargs="+", required=True)
parser.add_argument("--output", default="competitor_keywords.csv")
parser.add_argument("--openai_key", required=True)
args = parser.parse_args()
html_docs = fetch_html(args.urls)
kw_df = extract_keywords(html_docs)
intent_df = classify_intent(kw_df, args.openai_key)
final = kw_df.merge(intent_df, on="keyword", how="left", suffixes=("", "_ai"))
final = final.rename(columns={"intent_ai": "intent", "confidence_ai": "confidence_score"})
final["source_url"] = ", ".join(args.urls)
final.to_csv(args.output, index=False, encoding="utf-8-sig")
print(f"Exported {len(final)} keywords to {args.output}")
if __name__ == "__main__":
main()
Execute via: python competitor_kw.py --urls https://competitor1.com --output results.csv --openai_key sk-...
Validation & Output Formatting
Verify output integrity before downstream use.
df = pd.read_csv("competitor_keywords.csv", encoding="utf-8-sig")
assert df["keyword"].notna().all(), "Missing keywords"
assert df["intent"].isin(["informational", "commercial", "transactional"]).all(), "Invalid intents"
print(df["intent"].value_counts(normalize=True))
Validation Checklist:
- Row count matches expected extraction volume.
- Intent distribution reflects realistic ratios (e.g., ~60% informational).
- Zero
NaNvalues inkeyword,intent, orconfidence_score.
Scaling & Workflow Integration
Automate daily execution via cron (0 6 * * * cd /path && python competitor_kw.py ...) or GitHub Actions. Pipe the resulting CSV directly into content brief generators or ad campaign planners. This ingestion layer integrates seamlessly into broader AI Content Creation & Marketing Automation ecosystems, enabling automated clustering, brief generation, and performance tracking without manual data wrangling.