Your customer data is scattered. Names are typed three different ways, half the company fields are blank, and the notes your sales team left after each call are trapped in free text that no report can read. CRM data integration is the work of pulling those records out of your customer relationship management tool, cleaning them up, and adding new structured information back in. When you connect that pipeline to an AI model, you can do something a spreadsheet never could: read every messy note, score every lead, and tag every account automatically.
This guide shows you how to build that pipeline in Python, even if you have never written a sync job before. You will pull contacts from a CRM over its API (the doorway a program uses to talk to a service), clean the fields, enrich each contact with an AI model, and write the results back. We use the official openai SDK for the AI calls and httpx for the CRM calls, both of which are modern, well-supported, and beginner-friendly. This is one of the core building blocks of Building AI-Powered Business Applications.
Who this is for and what you will build
This is for founders, marketers, and operations people who run a CRM (HubSpot, Pipedrive, Salesforce, Zoho, and the like) and want to stop doing data hygiene by hand. By the end you will have a single Python script that:
- Reads contacts from your CRM in safe, paged batches.
- Standardizes emails, phone numbers, and company names.
- Asks an AI model to classify each contact's industry and write a one-line summary.
- Writes those AI fields back onto each contact so your team sees them inside the CRM.
The same fetch → clean → enrich → write-back shape powers every project in this section, including Sync HubSpot Contacts with Python, Enrich CRM Leads with AI in Python, and Summarize Sales Calls to Your CRM with Python. Learn it once here and those guides will feel like small variations rather than new puzzles.
It helps to picture why each stage exists. Fetching is about getting the data out reliably, in batches that respect the API's limits. Cleaning is about making the data consistent, so that "Acme Inc.", "acme inc", and "ACME" all become one company instead of three. Enriching is where the AI earns its keep, reading text a rule-based script could never parse and turning it into a tidy label or summary. Writing back closes the loop, putting that intelligence where your team already works instead of in some separate spreadsheet nobody opens. Skip any one of those stages and the pipeline stops being useful: clean data with no enrichment is just tidy data, and enrichment that never gets written back is intelligence trapped in a terminal window.
Prerequisites
You need Python 3.10 or newer. Check your version with python --version. If you have not set Python up yet, follow Setting Up Python for AI first, and ideally work inside a Python virtual environment so this project's packages stay isolated from the rest of your system.
Install the three packages this guide uses:
pip install openai httpx python-dotenv pandas
openaiis the official SDK for calling AI models.httpxis a modern HTTP client we use to talk to the CRM's API.python-dotenvloads your secret keys from a file instead of hardcoding them.pandasis a table library that makes cleaning rows of data quick.
Now create a file named .env in your project folder to hold your credentials. A credential is just a secret password your code uses to prove it is allowed to access a service.
OPENAI_API_KEY=sk-your-openai-key-here
CRM_API_TOKEN=your-crm-private-app-token
CRM_BASE_URL=https://api.hubapi.com
Important: add .env to your .gitignore file so these secrets never get committed to version control. If you skip this, your keys can leak the moment you push to GitHub. One line does it:
echo ".env" >> .gitignore
If your AI keys are new to you, Understanding LLM APIs explains where they come from, how billing works, and how to read the errors a model returns.
One more decision before you write any code: what data should ever leave your CRM? A good rule is to send the model only the fields it needs to do its job. The industry classifier needs the company name and a few notes — it does not need email addresses, phone numbers, deal values, or anything that identifies a real person. Stripping those out before the AI call is both safer and cheaper, because you pay for every word you send. We will keep this principle in mind through every step: clean locally, send the minimum, and write the result back. Most AI providers, including the OpenAI API, do not train on data you send through the API, but the only data that can never leak is the data you never send.
Step 1 — Load credentials and create your clients
Every script starts by reading your secrets and creating two "clients" — small objects that hold the connection details so you do not repeat them on every call. We load the .env file with python-dotenv, then build an OpenAI client for AI calls and an httpx.Client for CRM calls.
import os
import httpx
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv() # reads the .env file into environment variables
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
CRM_API_TOKEN = os.environ["CRM_API_TOKEN"]
CRM_BASE_URL = os.environ["CRM_BASE_URL"]
# The AI client. It picks up OPENAI_API_KEY automatically, but we pass it to be explicit.
ai = OpenAI(api_key=OPENAI_API_KEY)
# The CRM client. timeout means "give up after 30 seconds" so the script never hangs forever.
crm = httpx.Client(
base_url=CRM_BASE_URL,
headers={"Authorization": f"Bearer {CRM_API_TOKEN}"},
timeout=30.0,
)
Using os.environ["KEY"] (with square brackets) instead of os.getenv("KEY") means the script stops with a clear error if a key is missing, rather than silently sending a blank token and failing later with a confusing message. This "fail fast" habit saves you from the most common beginner trap, where a typo in a key name produces a vague 401 error twenty lines later instead of an obvious one on the first line.
A quick word on the two clients. They look similar but do very different jobs. The OpenAI client knows how to format requests the AI model expects and how to read its replies — you never touch raw HTTP with it. The httpx.Client is more general: it talks to any web API, and we point it at your CRM. Creating each client once and reusing it (rather than rebuilding it on every loop) keeps the underlying network connection alive, which makes a sync of thousands of contacts noticeably faster. Think of the client as a phone line you open once and keep talking on, instead of redialing for every sentence.
Step 2 — Pull contacts from the CRM in pages
CRMs do not hand you all your contacts at once. They return a "page" of records plus a pointer to the next page. This is called pagination, and respecting it is how you sync 50,000 contacts without running out of memory or tripping a rate limit (the cap on how many requests you can make per minute).
The loop below keeps asking for the next page until the CRM stops sending a paging.next.after cursor.
def fetch_contacts(page_size: int = 100) -> list[dict]:
"""Pull every contact from the CRM, one page at a time."""
contacts: list[dict] = []
after: str | None = None
while True:
params = {
"limit": page_size,
"properties": "email,phone,company,notes_last_contacted",
}
if after:
params["after"] = after
response = crm.get("/crm/v3/objects/contacts", params=params)
response.raise_for_status() # turn HTTP errors (401, 429, 500) into Python exceptions
data = response.json()
contacts.extend(data.get("results", []))
# The CRM tells us where the next page starts; if it is missing, we are done.
after = data.get("paging", {}).get("next", {}).get("after")
if not after:
break
return contacts
raise_for_status() is your safety net: if the CRM returns a 401 (bad token) or 429 (too many requests), the script raises an exception instead of quietly storing an error message as if it were data. Without it, a failed request would hand back a small JSON error blob, your loop would treat that blob as a "contact," and you would not notice the problem until the cleaning step produced nonsense.
Notice the properties parameter. Most CRMs return only a handful of default fields unless you ask for more by name, so we list exactly the four we want: email, phone, company, and the last-contacted notes. Requesting only what you need keeps each response small and fast, and it doubles as a privacy control — fields you never fetch can never accidentally end up in an AI prompt. The cursor pattern (paging.next.after) is the other thing worth understanding. Rather than asking for "page 2" by number, the CRM hands you an opaque token that means "start right after the last record you saw." This is more reliable than page numbers when records are being added or deleted mid-sync, because it never skips or repeats a row. For a deeper look at one CRM's exact endpoints and property names, see Sync HubSpot Contacts with Python.
Step 3 — Clean the data, then enrich each contact with AI
Raw CRM data is messy: JANE@COMPANY.COM , phone numbers with dashes and spaces, company names in mixed case. Clean these first so the AI sees consistent input and your downstream reports group correctly. We use pandas for the cleaning because it handles whole columns in one line.
import pandas as pd
def clean_contacts(raw: list[dict]) -> pd.DataFrame:
"""Flatten the CRM payload and normalize the messy fields."""
rows = [
{
"id": c["id"],
"email": (c["properties"].get("email") or "").lower().strip(),
"phone": (c["properties"].get("phone") or ""),
"company": (c["properties"].get("company") or "").strip().title(),
"notes": (c["properties"].get("notes_last_contacted") or "").strip(),
}
for c in raw
]
df = pd.DataFrame(rows)
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True) # keep digits only
df = df[df["email"] != ""] # drop contacts with no email
return df.drop_duplicates(subset=["email"]).reset_index(drop=True)
Now the enrichment. For each contact we ask the AI model to read the company name and notes, then return a structured answer: a guessed industry and a one-line summary. We force the model to reply as JSON (a strict text format programs can parse) using response_format, so we never have to guess at free-form text.
import json
def enrich_contact(company: str, notes: str) -> dict:
"""Ask the AI model to classify and summarize one contact."""
prompt = (
f"Company: {company or 'unknown'}\n"
f"Notes: {notes or 'none'}\n\n"
"Return JSON with two keys: 'industry' (a short label like 'SaaS' or "
"'Retail') and 'summary' (one sentence, under 20 words)."
)
response = ai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You label business contacts. Reply only with JSON."},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
temperature=0, # 0 = consistent, repeatable answers
)
return json.loads(response.choices[0].message.content)
We use gpt-4o-mini here because classification and short summaries do not need an expensive model, and temperature=0 makes the output stable so the same contact always gets the same label. Temperature controls how much randomness the model adds to its wording: at 0 it picks the most likely answer every time, which is exactly what you want for data fields that should be consistent across a report. The system message sets the model's role and reminds it to reply with JSON only, while the user message carries the specific contact. Keeping these two messages separate is a small prompt-engineering habit that makes the model's behavior far more predictable than cramming everything into one block of text.
Why force JSON at all? Because the alternative is parsing free-form sentences, which breaks the moment the model decides to be chatty. By setting response_format={"type": "json_object"} and asking for named keys, you get back something Python can read with a single json.loads call, every time. If you want to score leads by buying intent or extract more fields, Enrich CRM Leads with AI in Python builds directly on this step.
Step 4 — Write the enriched data back to the CRM
Enrichment is only useful if your team can see it inside the CRM. We send the AI's industry and summary back to custom properties on each contact with a PATCH request — the HTTP verb that means "update part of an existing record." Make sure those custom properties exist in your CRM first, or the write will be rejected.
def write_back(contact_id: str, enrichment: dict) -> None:
"""Update one contact with the AI-generated fields."""
payload = {
"properties": {
"ai_industry": enrichment.get("industry", ""),
"ai_summary": enrichment.get("summary", ""),
}
}
response = crm.patch(f"/crm/v3/objects/contacts/{contact_id}", json=payload)
response.raise_for_status()
That is the full loop. A couple of production notes before you run it at scale. First, write-backs are the one step that changes your live data, so test against a handful of contacts before turning it loose on your whole database — a df.head(5) while you experiment is cheap insurance. Second, PATCH updates only the fields you name, leaving everything else on the contact untouched; that is exactly the behavior you want, because you are adding intelligence, not overwriting your sales team's work. The next section assembles these four functions into one script you can run today.
Parameter reference
These are the settings you will most often adjust as you adapt the pipeline.
| Parameter | Type | Default | Effect |
|---|---|---|---|
page_size | int | 100 | How many contacts each CRM page returns. Lower it if you hit memory or rate limits. |
model | str | "gpt-4o-mini" | Which AI model enriches each contact. Bigger models cost more but reason better. |
temperature | float | 0 | Randomness of AI output. 0 gives repeatable labels; raise toward 1 for varied wording. |
response_format | dict | {"type": "json_object"} | Forces the model to return parseable JSON instead of free text. |
timeout | float | 30.0 | Seconds httpx waits before giving up on a slow CRM response. |
properties | str | "email,phone,..." | Comma-separated CRM fields to fetch. Request only what you need to keep payloads small. |
Troubleshooting
KeyError: 'OPENAI_API_KEY'— Your.envfile was not found or the key name is misspelled. Cause: the script runs from a different folder than the.envfile, orload_dotenv()runs after you read the variable. Fix: callload_dotenv()at the very top and run the script from the folder containing.env.httpx.HTTPStatusError: 401 Unauthorized— The CRM rejected your token. Cause: an expired, revoked, or wrong token, or a missingBearerprefix. Fix: regenerate the token in your CRM's private-app settings and confirm the header readsAuthorization: Bearer <token>. The same logic for AI keys is covered in Fix the 401 Unauthorized Error in OpenAI Python.httpx.HTTPStatusError: 429 Too Many Requests— You sent requests faster than the API allows. Cause: looping with no pause between calls. Fix: add a shorttime.sleep(0.5)between AI calls and wrap network calls in retry-with-backoff. See Fix the 429 Rate-Limit Error in Python.json.decoder.JSONDecodeError— The AI reply was not valid JSON. Cause:response_formatwas omitted, so the model wrapped its answer in prose. Fix: keepresponse_format={"type": "json_object"}and instruct the model to reply with JSON only. See Fix JSONDecodeError with AI API Responses in Python.400 Bad Requeston write-back — The CRM rejected the update. Cause: the custom property (ai_industryorai_summary) does not exist yet. Fix: create the properties in your CRM's settings before running the write step, matching the exact internal names.- AI cost or context errors on long notes — A very long notes field can blow past the model's input limit or run up your bill. Cause: sending entire call transcripts unfiltered. Fix: truncate notes to the first few hundred characters, or summarize them first. See Fix the Context-Length-Exceeded Error in Python.
Full worked example
Save this as crm_ai_sync.py, fill in your .env, create the ai_industry and ai_summary custom properties in your CRM, then run python crm_ai_sync.py. It fetches, cleans, enriches, and writes back, with retries and a pause to stay under rate limits.
import os
import json
import time
import httpx
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv() # load secrets; remember .env must be in .gitignore
ai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
crm = httpx.Client(
base_url=os.environ["CRM_BASE_URL"],
headers={"Authorization": f"Bearer {os.environ['CRM_API_TOKEN']}"},
timeout=30.0,
)
def fetch_contacts(page_size: int = 100) -> list[dict]:
contacts, after = [], None
while True:
params = {"limit": page_size, "properties": "email,phone,company,notes_last_contacted"}
if after:
params["after"] = after
resp = crm.get("/crm/v3/objects/contacts", params=params)
resp.raise_for_status()
data = resp.json()
contacts.extend(data.get("results", []))
after = data.get("paging", {}).get("next", {}).get("after")
if not after:
return contacts
def clean_contacts(raw: list[dict]) -> pd.DataFrame:
rows = [{
"id": c["id"],
"email": (c["properties"].get("email") or "").lower().strip(),
"company": (c["properties"].get("company") or "").strip().title(),
"notes": (c["properties"].get("notes_last_contacted") or "").strip()[:400],
} for c in raw]
df = pd.DataFrame(rows)
df = df[df["email"] != ""]
return df.drop_duplicates(subset=["email"]).reset_index(drop=True)
def enrich_contact(company: str, notes: str, retries: int = 3) -> dict:
prompt = (f"Company: {company or 'unknown'}\nNotes: {notes or 'none'}\n\n"
"Return JSON with 'industry' (short label) and 'summary' (one sentence under 20 words).")
for attempt in range(retries):
try:
resp = ai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "You label business contacts. Reply only with JSON."},
{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
except Exception as err: # back off and retry on transient failures
if attempt == retries - 1:
raise
time.sleep(2 ** attempt)
def write_back(contact_id: str, enrichment: dict) -> None:
payload = {"properties": {"ai_industry": enrichment.get("industry", ""),
"ai_summary": enrichment.get("summary", "")}}
crm.patch(f"/crm/v3/objects/contacts/{contact_id}", json=payload).raise_for_status()
if __name__ == "__main__":
df = clean_contacts(fetch_contacts())
print(f"Enriching {len(df)} contacts...")
for row in df.itertuples():
enrichment = enrich_contact(row.company, row.notes)
write_back(row.id, enrichment)
time.sleep(0.5) # gentle pause to respect rate limits
print("Done. Check the ai_industry and ai_summary fields in your CRM.")
Next steps
You now have a working pipeline. From here, deepen one stage at a time:
- Master one CRM's exact API with Sync HubSpot Contacts with Python.
- Go beyond labels into lead scoring with Enrich CRM Leads with AI in Python.
- Turn recorded calls into CRM notes with Summarize Sales Calls to Your CRM with Python.
- Put this behind a chat interface so your team can ask questions of the data with Custom AI Chatbot Development, or package it as a product with SaaS MVP with Python and AI.
Back to Building AI-Powered Business Applications.