This guide shows you how to turn a messy CSV file into a clean, model-ready dataset with Pandas in under fifteen minutes. You will load a raw file, fix missing values, drop duplicates, set the right column types, normalize text, and export a file you can hand straight to an AI step. Replace the example column names with your own headers, run the script, and you get a predictable dataset every time.
Pandas is the standard Python library for working with table-shaped data (rows and columns, like a spreadsheet). A DataFrame is the in-memory table Pandas builds from your file. If those two terms are new, that is fine, the steps below explain each line as you go.
Raw exports cause more AI problems than most people expect. Inconsistent casing turns Acme and acme into two different things. Trailing spaces hide in text fields. Empty cells trigger API validation errors. A clean file fixes all of that before the data ever reaches a model, which is why data hygiene is a foundation skill covered across Data Cleaning for AI.
Prerequisites
This guide assumes you already have Python 3.10 or newer and a working virtual environment. If you do not, set one up first with Create a Python Virtual Environment for AI, then return here.
Install the one dependency you need:
pip install pandas
You also need a CSV file to clean. The examples below use a file named input.csv with columns called company, prompt_text, category, and price. Swap these for your real headers as you follow along. To see your actual column names, run a quick check first:
import pandas as pd
df = pd.read_csv("input.csv")
print(df.columns.tolist())
print(df.shape) # (rows, columns)
df.shape prints a pair like (1200, 4), meaning 1,200 rows and 4 columns. Note that starting number, you will compare it against the cleaned count at the end to confirm nothing went missing by accident.
Step 1: Load the CSV and inspect it
Always look at the data before you change it. Loading and inspecting takes seconds and saves you from cleaning the wrong column.
import pandas as pd
# Load the raw file. utf-8-sig safely strips the hidden BOM
# marker that Excel adds to the start of exported files.
df = pd.read_csv("input.csv", encoding="utf-8-sig")
# Inspect structure before changing anything.
print(df.info()) # column names, non-null counts, and types
print(df.head()) # the first five rows
df.info() is the most useful single command here. It lists every column, how many non-null (non-empty) values it holds, and the type Pandas guessed for it. If a column you expect to be numbers shows up as object (Pandas's label for text), that is an early warning that the column contains stray text you will fix in Step 3.
A BOM (byte order mark) is an invisible character some programs write at the start of a file. Reading with encoding="utf-8-sig" removes it so it does not contaminate your first column header.
Step 2: Fix missing values
Missing values are the most common cause of AI API errors, because most endpoints reject empty or null text. Handle required fields and optional fields differently.
# Required fields: drop any row missing the text you will send
# to the model or the label you group by. A row with no prompt
# text is unusable, so remove it.
df = df.dropna(subset=["prompt_text", "category"])
# Optional fields: keep the row but fill the gap with a clear
# placeholder so the value is explicit, not silently empty.
df["company"] = df["company"].fillna("unknown")
# Confirm there are no remaining gaps in the required columns.
remaining_gaps = df[["prompt_text", "category"]].isnull().sum().sum()
print(f"Remaining gaps in required columns: {remaining_gaps}")
The difference matters. dropna deletes rows, so you only want it where a missing value makes the row worthless. fillna keeps the row and replaces the gap, which preserves data you can still use. Print the gap count to confirm the required columns are now fully populated.
Step 3: Remove duplicates and fix types
Duplicate rows inflate costs (you pay to process the same text twice) and skew any counts or analysis. Wrong types cause crashes the moment you call a text method on a number.
# Drop rows that are exact duplicates across every column.
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} duplicate rows")
# Cast each column to the type your AI step expects.
# Text columns must be strings before any .str operation.
df["prompt_text"] = df["prompt_text"].astype(str)
df["category"] = df["category"].astype(str)
# Numeric columns: convert text to numbers, turning any
# unparseable value (like "N/A") into a true missing value.
df["price"] = pd.to_numeric(df["price"], errors="coerce")
drop_duplicates() with no arguments removes rows that match across all columns. If two rows should count as duplicates based on one key column only, pass subset=["prompt_text"] to compare just that field.
pd.to_numeric(..., errors="coerce") is the safe way to convert text to numbers. The errors="coerce" setting turns anything it cannot parse into NaN (Pandas's missing-value marker) instead of crashing. After this line you can fill or drop those new gaps the same way you did in Step 2.
Step 4: Normalize text fields
Normalizing means forcing text into one consistent shape so identical values actually match. This is the step that most improves prompt accuracy and embedding quality.
# Build the cleaning into one readable chain per column.
df["prompt_text"] = (
df["prompt_text"]
.str.strip() # remove leading/trailing spaces
.str.replace(r"\s+", " ", regex=True) # collapse runs of whitespace
.str.replace(r"[\r\n]+", " ", regex=True) # flatten line breaks
)
# Lowercase category labels so "Sales", "sales", and "SALES"
# all become one value you can group and filter on.
df["category"] = df["category"].str.strip().str.lower()
Each method in the chain does one job. .str.strip() trims the outer spaces. .str.replace(r"\s+", " ", regex=True) collapses double and triple spaces into single ones. The line-break replacement flattens hidden \r and \n characters that break CSV and JSON payloads. Lowercasing labels is what lets later grouping treat Sales and sales as the same category.
Be deliberate about which columns you lowercase. Lowercasing a category label is helpful; lowercasing a sentence you plan to show a user later may not be. Apply it only where consistent matching matters more than the original casing.
Step 5: Validate and export a clean CSV
Before you trust the file, run three quick checks, then write it to a new name so your raw data stays intact.
# 1. No gaps left in required columns.
assert df[["prompt_text", "category"]].isnull().sum().sum() == 0
# 2. Row count dropped only as much as expected.
print(f"Final rows: {len(df)}")
# 3. Spot-check a few cleaned rows by eye.
print(df.sample(min(5, len(df))))
# Export to a NEW file in plain UTF-8. index=False stops Pandas
# from writing an extra unnamed row-number column.
df.to_csv("clean_output.csv", index=False, encoding="utf-8")
print("Saved clean_output.csv, ready for your AI step.")
Writing to clean_output.csv instead of overwriting input.csv means you can re-run with different settings if you spot a problem. The cleaned file is now ready to feed into an AI workflow, whether you are sending each row to a model or building embeddings from it. For where that data goes next, see Understanding LLM APIs.
Key-parameter quick reference
| Parameter | Method | Default | Effect |
|---|---|---|---|
encoding | pd.read_csv / to_csv | platform default | Set to "utf-8-sig" to strip Excel's BOM on read; "utf-8" on write keeps accents and emoji intact. |
subset | dropna / drop_duplicates | all columns | Limits the check to the listed columns, so you only act on the fields that matter. |
errors | pd.to_numeric | "raise" | Set to "coerce" to turn unparseable values into NaN instead of crashing the script. |
index | to_csv | True | Set to False to avoid writing an extra unnamed row-number column to your output file. |
Troubleshooting
UnicodeDecodeError: 'utf-8' codec can't decode byte— Cause: the file is not UTF-8, often it is Latin-1 from an older system. Fix: pass the matching encoding, for examplepd.read_csv("input.csv", encoding="latin-1").AttributeError: Can only use .str accessor with string values— Cause: you called a.strmethod on a column that holds numbers or mixed types. Fix: cast it first withdf["col"] = df["col"].astype(str)before the normalization chain.KeyError: 'prompt_text'— Cause: the column name in your code does not match the file, often due to a trailing space or different casing in the header. Fix: runprint(df.columns.tolist())and copy the exact name, or normalize headers withdf.columns = df.columns.str.strip().- Cleaned file shows garbled symbols in Excel — Cause: Excel expects a BOM to read UTF-8 correctly. Fix: write with
df.to_csv("clean_output.csv", index=False, encoding="utf-8-sig")so Excel renders accents and emoji properly.
When to use this vs. alternatives
- Use this Pandas script when your data fits in memory (up to a few hundred thousand rows on a normal laptop) and you want full control over each cleaning rule. This is the right default for almost every creator, marketer, or founder project.
- Reach for a database query instead when the file is too large to load at once or already lives in a SQL system. Filtering and deduplicating in SQL before exporting a smaller, cleaner CSV avoids loading the whole thing into Python.
- Skip dedicated cleaning code when the work is genuinely one-off and tiny, say a dozen rows you can fix by hand in a spreadsheet. A script pays off when you will repeat the job or need the result to be exactly reproducible. If this cleaning is part of a recurring pipeline, wrap it in a scheduled run as shown in Automating Repetitive Tasks with Python.
Back to Data Cleaning for AI.
Related guides
- Data Cleaning for AI — the main guide for preparing data for AI, with the full set of preprocessing techniques.
- Automating Repetitive Tasks with Python — turn this one-off clean-up into a script that runs on a schedule.
- Understanding LLM APIs — where your cleaned data goes next when you send it to a model.
- Create a Python Virtual Environment for AI — set up the isolated Python environment these scripts assume.