This guide shows you how to turn a messy CRM lead into a tidy, structured record in under fifteen minutes: an AI model reads the raw text a lead left behind, infers the industry, company size, and buying intent, writes a one-line summary, and your Python script saves all of it back to the contact.
Most leads arrive half-empty. Someone fills in a name and an email, drops a sentence into a "tell us about your project" box, and that is all your sales team has to work with. The useful facts are buried in that sentence, plus the email domain and a job title, and nobody has time to read and tag hundreds of them by hand. An AI model can do that reading and tagging in seconds, and it can hand back results in a fixed shape your CRM understands.
Prerequisites
You only need a few things beyond a working Python install. If you have not set up Python yet, start with Create a Python Virtual Environment for AI and come back. New to calling AI models from code? The plain-English walkthrough in Understanding LLM APIs covers the request-and-response pattern this guide builds on.
You need Python 3.10 or newer, a funded OpenAI account, and the two packages below.
python -m pip install "openai>=1.40" python-dotenv
Create a file named .env in your project folder and add your key:
OPENAI_API_KEY=sk-your-key-here
Add .env to your .gitignore so your key is never committed to version control.
This guide focuses on the AI enrichment itself and prints the result. To push results into a live CRM, pair it with Sync HubSpot Contacts with Python, which covers the authenticated update call in detail.
Step 1: Define the fields you want back
Before you call any model, decide exactly what "enriched" means for your business. Vague requests get vague answers. A schema is just a list of the fields you want, their types, and the allowed values, written so the model has no room to improvise. Notice that every category includes an unknown option, so the model has an honest answer when the text gives it nothing to go on.
# schema.py
LEAD_SCHEMA = {
"type": "object",
"additionalProperties": False,
"properties": {
"industry": {
"type": "string",
"description": "Best guess at the lead's industry, e.g. 'SaaS', 'Healthcare', 'E-commerce'. Use 'unknown' if unclear.",
},
"company_size": {
"type": "string",
"enum": ["1-10", "11-50", "51-200", "201-1000", "1000+", "unknown"],
},
"intent": {
"type": "string",
"enum": ["ready_to_buy", "evaluating", "researching", "just_browsing", "unknown"],
},
"summary": {
"type": "string",
"description": "One plain sentence describing who the lead is and what they want.",
},
"confidence": {
"type": "number",
"description": "How confident you are in these inferences, from 0.0 to 1.0.",
},
},
"required": ["industry", "company_size", "intent", "summary", "confidence"],
}
The enum lists matter. They turn open-ended guesses into a small set of values your sales filters can actually count on, so "company size" is always one of six tidy buckets rather than free text like "smallish" or "a few hundred people".
Keep the schema lean. Every field you add is one more thing the model has to reason about and one more column your team has to act on. Start with the four or five fields that drive a real decision, ship them, and add more only when a teammate asks for them. A bloated schema produces slower, costlier calls and rarely earns its keep.
Step 2: Send one lead to the model in structured-output mode
Structured-output mode is the part that makes this reliable. When you attach your schema with response_format, the model is forced to return valid JSON that matches your fields exactly, so you never scrape an answer out of a paragraph of prose. If you have ever wrestled with broken parsing, the background in Fix JSONDecodeError with AI API Responses in Python shows why this approach removes the problem at the source.
# enrich.py
import json
import os
from dotenv import load_dotenv
from openai import OpenAI
from schema import LEAD_SCHEMA
load_dotenv()
client = OpenAI() # reads OPENAI_API_KEY from the environment
SYSTEM_PROMPT = (
"You enrich raw CRM leads. Infer each field only from the text provided. "
"Never invent specific facts. When the text does not support a value, "
"use 'unknown' and lower the confidence score."
)
def enrich_lead(lead: dict) -> dict:
"""Send one raw lead to the model and return structured enrichment."""
user_message = (
f"Name: {lead.get('name', '')}\n"
f"Email: {lead.get('email', '')}\n"
f"Job title: {lead.get('job_title', '')}\n"
f"Message: {lead.get('message', '')}"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "lead_enrichment",
"schema": LEAD_SCHEMA,
"strict": True,
},
},
)
return json.loads(response.choices[0].message.content)
The email domain is doing quiet work here. dana@acme-clinic.com nudges the model toward healthcare without you writing a single rule. Setting temperature=0 keeps results steady, so the same lead returns the same labels on Monday and Friday.
The system prompt is your guardrail against invented facts. The two instructions that matter most are "infer only from the text provided" and "use 'unknown' when the text does not support a value". Without them, a model will happily guess a headcount it has no way of knowing. With them, a sparse lead returns honest unknown values and a low confidence score, which is exactly what you want feeding into the gate in the next step.
Step 3: Validate before you trust the result
The model returns valid JSON, but "valid JSON" is not the same as "good data". A quick check protects your CRM from low-confidence guesses landing in fields your team treats as fact. Route anything below your threshold to a human queue instead of writing it blindly.
# validate.py
def is_trustworthy(enrichment: dict, threshold: float = 0.6) -> bool:
"""Decide whether enrichment is confident enough to auto-apply."""
if enrichment["confidence"] < threshold:
return False
# An 'unknown' industry on a high-confidence record is contradictory.
if enrichment["industry"] == "unknown" and enrichment["confidence"] > 0.8:
return False
return True
Tune the threshold to your appetite for risk. A team that prizes clean data sets it high and reviews more leads by hand; a team that wants every record tagged sets it low and accepts a few rough guesses.
Step 4: Write the enriched fields back to the CRM
The final step sends your parsed result to the CRM as an update on the matching contact. The shape of this call depends on your CRM, but the pattern is always the same: an authenticated request that maps your enrichment fields onto the right custom properties. Below is the HubSpot-style version; swap the URL and field names for your own system.
# writeback.py
import os
import httpx
def write_back(contact_id: str, enrichment: dict) -> None:
"""Save enrichment to custom properties on a CRM contact."""
token = os.environ["HUBSPOT_TOKEN"]
url = f"https://api.hubapi.com/crm/v3/objects/contacts/{contact_id}"
payload = {
"properties": {
"ai_industry": enrichment["industry"],
"ai_company_size": enrichment["company_size"],
"ai_intent": enrichment["intent"],
"ai_summary": enrichment["summary"],
}
}
response = httpx.patch(
url,
json=payload,
headers={"Authorization": f"Bearer {token}"},
timeout=30.0,
)
response.raise_for_status()
Create the custom properties (ai_industry, ai_intent, and the rest) in your CRM settings before you run this, or the update will be rejected for unknown fields. Add your HUBSPOT_TOKEN to the same .env file and keep that file out of version control.
Worked example: enrich a batch end to end
This script ties the four steps together. It reads leads, enriches each one, applies the confidence gate, and either writes the result back or flags it for review. It uses a small sample list so you can run it immediately; replace SAMPLE_LEADS with rows pulled from your CRM.
# run.py
import time
from enrich import enrich_lead
from validate import is_trustworthy
# from writeback import write_back # uncomment when your CRM fields exist
SAMPLE_LEADS = [
{
"id": "101",
"name": "Dana Reyes",
"email": "dana@acme-clinic.com",
"job_title": "Operations Lead",
"message": "We run three clinics and need to automate patient reminders soon.",
},
{
"id": "102",
"name": "Sam Cole",
"email": "sam@gmail.com",
"job_title": "",
"message": "Just looking around, might come back later.",
},
]
def main() -> None:
for lead in SAMPLE_LEADS:
enrichment = enrich_lead(lead)
if is_trustworthy(enrichment):
print(f"AUTO {lead['name']}: {enrichment['summary']}")
# write_back(lead["id"], enrichment)
else:
print(f"REVIEW {lead['name']}: low confidence -> human queue")
time.sleep(0.5) # gentle pacing keeps you under rate limits
if __name__ == "__main__":
main()
Run it with python run.py. The clinic lead should come back as healthcare with a clear "evaluating" or "ready_to_buy" intent and a high confidence score, while the browsing lead should land in the review queue. The time.sleep(0.5) is your friend on large batches; for a deeper fix when volume climbs, see Fix the 429 Rate-Limit Error in Python.
Parameter quick reference
| Parameter | Type | Default | Effect |
|---|---|---|---|
model | string | gpt-4o-mini | The model that reads and classifies each lead. The compact mini model is cheap and accurate enough for this task; upgrade only if labels feel weak. |
temperature | float | 0 | Controls randomness. Keep it at 0 so the same lead always gets the same labels. |
response_format | object | none | Attaches your JSON schema. With strict: True, the model must return valid JSON matching your fields. |
threshold | float | 0.6 | Your confidence cut-off. Records below it go to human review instead of being written back. |
Troubleshooting
openai.AuthenticationError: 401— Your key is missing or wrong. Confirm.envsits in the folder you run from and thatload_dotenv()is called beforeOpenAI(). The step-by-step cure is in Fix the 401 Unauthorized Error in OpenAI Python.BadRequestErrormentioningresponse_formator schema — Your schema is malformed. Withstrict: True, every property must appear in therequiredlist andadditionalPropertiesmust beFalse. Match the schema in Step 1 exactly.- Every field comes back
unknown— The model has nothing to work with. Check that you are actually passing the message text and email intouser_message; an empty string in, empty inferences out. - HubSpot returns
400 Property ... does not exist— You are writing to custom fields that have not been created yet. Addai_industry,ai_intent, and the others in your CRM's property settings before running the write-back.
When to use this vs. alternatives
- Use AI enrichment when the signal lives in free text — a project description, a support note, or a job title the model can interpret. This is exactly where rigid rules fall down and a language model shines.
- Use a data-enrichment provider (Clearbit, Apollo) when you need verified firmographics — headcount, funding, exact revenue. Those services look companies up in a database; the AI here only infers from the text in front of it, so prefer a database when you need hard facts rather than smart guesses.
- Use simple
if/elserules when the mapping is fixed — for example, "any.eduemail is the education segment". A rule is free, instant, and never wrong, so do not reach for a model when a one-line condition already settles it.
Back to CRM Data Integration with AI.
Related guides
- CRM Data Integration with AI — the main guide covering the full fetch, clean, enrich, and write-back loop.
- Sync HubSpot Contacts with Python — pull and update contacts so enriched data has somewhere to land.
- Summarize Sales Calls to Your CRM with Python — another AI-to-CRM workflow that pairs naturally with lead enrichment.
- Understanding LLM APIs — the foundations of calling AI models from Python.