Fundamentals

Data Cleaning for AI: A Step-by-Step Python Guide

You exported a spreadsheet of customer feedback, fed it straight into an AI model, and got back nonsense: half the rows summarized blank cells, the costs were higher than expected, and a few records crashed with a type error. The model was not broken. The data was. Every AI tool, from a simple summarizer to a full machine-learning pipeline, is only as reliable as the numbers and text you hand it.

Data cleaning is the unglamorous work that decides whether your AI project succeeds. It means taking a messy export full of blank cells, duplicate rows, mixed-up date formats, and inconsistent text, then turning it into a tidy table where every column means exactly one thing and every value is the type you expect. This guide walks you through that process in plain Python, using pandas (the standard Python library for working with tables of data). You will load a messy file, fix the four problems that wreck almost every dataset, normalize text so an AI model can read it, and export a clean result you can trust.

This is a core skill in the Python AI Fundamentals for Non-Developers track. You do not need a data-science degree to follow along. If you can run a Python script and read a spreadsheet, you can clean data well. By the end you will have a reusable cleaning script and a clear checklist you can apply to any future dataset.

The data cleaning pipeline Raw messy data flows through four cleaning stages and exits as a clean dataset ready for an AI model. Raw messy data blanks, dupes 1. Load & audit 2. Fix gaps, dupes, types 3. Normalize text 4. Validate & export Clean data for AI model
Every dataset travels the same path: load and audit, fix structure, normalize text, then validate and export.

Who needs this and what it solves

If you are a marketer pulling campaign exports, a founder merging spreadsheets from two tools, a creator collecting audience feedback, or a student preparing a dataset for a class project, this guide is for you. The task is always the same: you have a file that a human can roughly read but a machine cannot trust, and you need to make it consistent enough that an AI model produces stable, repeatable results.

Skipping this step does not throw an obvious error. Instead it leaks bad data quietly. A model trained on duplicate rows over-weights those examples. An API call on a blank cell wastes tokens and returns empty summaries. A column that is secretly text instead of a number breaks every calculation downstream. Clean data prevents all of this before it starts.

The good news is that the vast majority of dirty data comes down to just four recurring problems: missing values, duplicate rows, wrong data types, and inconsistent text. Once you know how to spot and fix those four, you can clean almost any tabular file you will ever meet. The rest of this guide walks through each one in order, using a single small dataset so you can see exactly what changes at every step. Treat the workflow as a checklist you run top to bottom rather than a set of tricks to memorize.

Prerequisites

You need Python 3.10 or higher. Python 3.9 reached end-of-life in October 2025, so upgrade if you are still on it. If your machine is not set up yet, work through Setting Up Python for AI first to install Python and create a virtual environment.

Install the libraries you need. Pandas does the heavy lifting; the others handle Excel files, Parquet export, and reading credentials from a file:

pip install pandas openpyxl pyarrow python-dotenv

Most cleaning needs no API keys at all. But if your pipeline later pulls data from a service or sends records to an AI API, store the keys in a .env file (a plain text file holding secret values) rather than pasting them into your code:

OPENAI_API_KEY=sk-your-key-here
DATA_SOURCE_URL=https://example.com/export

Load those values at the top of your script with python-dotenv:

import os
from dotenv import load_dotenv

load_dotenv()  # reads the .env file in your project folder
api_key = os.environ.get("OPENAI_API_KEY")

Important: add .env to your .gitignore file so you never commit secrets to version control. A single line — .env — in .gitignore is enough to keep your keys off GitHub.

To follow the steps with real numbers, save this small messy file as raw_feedback.csv. It deliberately contains every problem this guide fixes:

customer_id,name,signup_date,plan_price,feedback
101,  Ada Lovelace ,2026-01-04,29.99,"<p>Great tool!</p>   Saved me hours."
102,Grace Hopper,2026/01/05,29.99,"Confusing onboarding,  but support helped."
102,Grace Hopper,2026/01/05,29.99,"Confusing onboarding,  but support helped."
103,Alan Turing,,,"  loved   it!!! "
104,KATHERINE JOHNSON,2026-01-07,unknown,

Step 1: Load messy data with pandas

Start every cleaning job with an audit. You cannot fix problems you have not seen. Read the file into a DataFrame (pandas' name for a table of rows and columns) and run three quick checks: the shape and column types, a summary of the numbers, and a count of missing values per column.

import pandas as pd

# Load the raw file. encoding and on_bad_lines guard against common surprises.
df = pd.read_csv("raw_feedback.csv", encoding="utf-8", on_bad_lines="warn")

print("Shape (rows, columns):", df.shape)
print("\nColumn types:\n", df.dtypes)
print("\nMissing values per column:\n", df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())
print("\nFirst rows:\n", df.head())

Running this against the sample file tells you a lot. You will see five rows, one full duplicate, a missing signup_date and plan_price on row 103, a missing feedback on row 104, and a plan_price column that pandas read as text (object) instead of a number, because one cell says unknown. The dates are also inconsistent: some use dashes, some use slashes.

Write these findings down before you change anything. That note becomes your baseline. After cleaning, you will check the result against it to confirm you fixed what you meant to and nothing more. A baseline turns cleaning from guesswork into something you can verify: if you started with five rows and one duplicate, you should end with four unique rows, and you can prove it.

Two small habits make this audit far more useful. First, look at the actual values, not just the counts — df.head() and df.sample(5) often reveal a stray symbol or an unexpected format that a summary hides. Second, archive the raw file before you touch it. Copy raw_feedback.csv somewhere safe so you can always re-run cleaning from scratch when you discover a bug a week later. If you are working specifically with spreadsheet exports, the companion guide Cleaning CSV Data with Pandas for AI goes deeper on encoding quirks, delimiters, and multi-sheet Excel files.

Step 2: Handle missing values, duplicates, and types

Now fix the structure. Tackle the three problems in a sensible order: remove duplicates first so you are not imputing values into rows you will delete anyway, then handle missing values column by column, then force each column into the right type.

Remove duplicates. Exact duplicate rows almost always come from a bad export or a join gone wrong. Drop them and reset the row numbers:

df = df.drop_duplicates().reset_index(drop=True)

Handle missing values per column. There is no single rule. Decide column by column based on what the column means:

# Drop rows missing a critical identifier — these records are unusable.
df = df.dropna(subset=["customer_id"])

# Fill missing prices with the median of the known prices.
# First turn non-numeric text like "unknown" into a real missing value.
df["plan_price"] = pd.to_numeric(df["plan_price"], errors="coerce")
df["plan_price"] = df["plan_price"].fillna(df["plan_price"].median())

# Fill missing feedback with an explicit placeholder, not a blank.
df["feedback"] = df["feedback"].fillna("no feedback provided")

The errors="coerce" flag is the key trick here. It tells pandas to convert anything it cannot read as a number — like the word unknown — into NaN (pandas' marker for a missing value) instead of crashing. You then fill those gaps with the median, which is more robust to outliers than the average.

Fix data types. A column stored as the wrong type breaks everything downstream. Convert the ID to a clean integer string, the date to a real date, and the price to a number:

# IDs are identifiers, not math — keep them as strings without decimals.
df["customer_id"] = df["customer_id"].astype(int).astype(str)

# Parse mixed date formats into real dates; bad dates become NaT (missing).
df["signup_date"] = pd.to_datetime(df["signup_date"], format="mixed", errors="coerce")

The format="mixed" option lets pandas figure out each date individually, so 2026-01-04 and 2026/01/05 both parse correctly. After this step, run your audit again and confirm the duplicate is gone, no critical IDs are missing, and plan_price is now a real number.

Step 3: Normalize text for AI

Text is where AI projects quietly lose accuracy and money. The same name written as Ada Lovelace, KATHERINE JOHNSON, and Grace Hopper looks like three different styles to a model. Feedback wrapped in HTML tags like <p> wastes tokens and confuses prompts. Normalizing means putting text into one predictable shape.

Start with the simple column-wide cleanups pandas gives you for free:

# Strip leading/trailing spaces and standardize the name column.
df["name"] = df["name"].str.strip().str.title()

.str.strip() removes the padding spaces and .str.title() turns KATHERINE JOHNSON into Katherine Johnson so casing is consistent. For the free-text feedback you need more control, so write a small function and apply it to the whole column:

import re


def clean_text(value: str) -> str:
    """Return a normalized, model-friendly version of a text value."""
    if not isinstance(value, str):
        return ""
    text = re.sub(r"<[^>]+>", " ", value)          # remove HTML tags
    text = re.sub(r"[^\w\s.,!?'-]", " ", text)      # drop stray symbols
    text = re.sub(r"([.!?])\1+", r"\1", text)       # collapse "!!!" to "!"
    text = re.sub(r"\s+", " ", text).strip()        # collapse whitespace
    return text


df["feedback"] = df["feedback"].apply(clean_text)

Each line targets one problem. The first removes HTML so <p>Great tool!</p> becomes Great tool!. The second drops symbols that carry no meaning while keeping useful punctuation. The third collapses runs of repeated punctuation like !!! down to a single mark. The last squeezes every block of spaces, tabs, and newlines into a single space and trims the ends.

Notice what this function does not do: it does not lowercase the feedback or strip out all punctuation. For modern AI models, casing and punctuation often carry meaning — cheap and CHEAP! signal different emotions. Over-cleaning is a real risk. Clean enough to be consistent, not so much that you erase signal. Once your text is normalized this cleanly, it slots straight into prompts; see Understanding LLM APIs for how to format those records into requests and manage token limits.

Step 4: Export the clean dataset

Before you save, validate. A short set of assertions catches mistakes immediately instead of letting a quiet bug travel into your model:

assert df["customer_id"].is_unique, "Duplicate customer IDs remain"
assert df["plan_price"].notnull().all(), "Some prices are still missing"
assert df["plan_price"].between(0, 1000).all(), "A price is outside the expected range"
print("Validation passed:", len(df), "clean rows")

If every assertion passes, export. Choose the format that fits where the data is going. Never overwrite your raw file — keep the original so you can re-run cleaning if you find a bug later.

# CSV: human-readable, opens in any spreadsheet.
df.to_csv("clean_feedback.csv", index=False)

# Parquet: faster to reload and remembers data types exactly.
df.to_parquet("clean_feedback.parquet", index=False)

# JSON lines: one record per line, ideal for feeding an AI API row by row.
df.to_json("clean_feedback.jsonl", orient="records", lines=True)

CSV is best for sharing and quick inspection. Parquet is best when you will reload the data often, because it preserves the data types you worked so hard to fix. Newline-delimited JSON (.jsonl) is best when each row becomes one AI API call. Now your dataset is consistent, typed, and validated — ready for any model.

Parameter reference

The cleaning functions above all take optional arguments that change their behavior. These are the ones you will reach for most often.

ParameterTypeDefaultEffect
pd.read_csv(encoding=...)str"utf-8"Character encoding of the file. Use "latin-1" or "cp1252" for older Excel exports that fail to load.
pd.read_csv(on_bad_lines=...)str"error"What to do with malformed rows. "warn" skips and reports them; "skip" skips silently.
drop_duplicates(subset=...)listNone (all columns)Which columns define a duplicate. Pass ["customer_id"] to dedupe on ID alone.
dropna(subset=...)listNone (any column)Drop a row only if these specific columns are missing, instead of any column.
fillna(value=...)scalarrequiredThe value used to replace missing entries. Common picks: a median, a mode, or a placeholder string.
pd.to_numeric(errors=...)str"raise""coerce" turns unparseable text into NaN instead of crashing the script.
pd.to_datetime(format=...)strinferredSet "mixed" to parse rows with different date formats individually.
to_csv(index=...)boolTrueSet False to avoid writing pandas' row numbers as an extra unnamed column.

Troubleshooting

These are the errors you will most likely hit, with the exact message, the cause, and a one-line fix.

  1. UnicodeDecodeError: 'utf-8' codec can't decode byte 0x... — The file is not saved in UTF-8, common with exports from older Windows tools. Reload with the matching encoding: pd.read_csv("file.csv", encoding="latin-1").
  2. ValueError: Unable to parse string "unknown" at position 4 — You called pd.to_numeric() on a column containing non-numeric text. Add errors="coerce" so the bad value becomes NaN: pd.to_numeric(df["plan_price"], errors="coerce").
  3. KeyError: 'customer_id' — The column name does not match exactly, often because of a trailing space or different casing in the header. Run print(df.columns.tolist()) to see the real names, then strip them with df.columns = df.columns.str.strip().
  4. SettingWithCopyWarning: A value is trying to be set on a copy of a slice — You are editing a filtered slice of the DataFrame, not the original. Assign back to a full column (df["col"] = ...) or call .copy() when you create the subset.
  5. TypeError: 'NoneType' object is not subscriptable inside your text function — A cell holds a missing value, not a string, so string operations fail. Guard the function with if not isinstance(value, str): return "" as shown in Step 3.
  6. ValueError: time data '2026/01/05' doesn't match format — Your dates use more than one format. Replace a fixed format string with format="mixed" so pandas parses each value on its own.

Worked example: a complete cleaning script

This script ties every step together. Save it as clean_dataset.py and run it with python clean_dataset.py against the raw_feedback.csv sample. It loads, audits, fixes structure, normalizes text, validates, and exports in one pass — exactly the reusable pipeline you want for any future dataset.

import re
import pandas as pd

RAW_FILE = "raw_feedback.csv"
CLEAN_FILE = "clean_feedback.csv"


def clean_text(value: str) -> str:
    """Normalize a free-text field into a consistent, model-friendly form."""
    if not isinstance(value, str):
        return "no feedback provided"
    text = re.sub(r"<[^>]+>", " ", value)        # strip HTML tags
    text = re.sub(r"[^\w\s.,!?'-]", " ", text)    # drop stray symbols
    text = re.sub(r"([.!?])\1+", r"\1", text)     # collapse repeated punctuation
    text = re.sub(r"\s+", " ", text).strip()      # collapse whitespace
    return text or "no feedback provided"


def clean_dataset(path: str) -> pd.DataFrame:
    df = pd.read_csv(path, encoding="utf-8", on_bad_lines="warn")
    print(f"Loaded {len(df)} rows, {df.duplicated().sum()} duplicates found")

    df = df.drop_duplicates().reset_index(drop=True)        # remove duplicates
    df = df.dropna(subset=["customer_id"])                  # require an ID

    df["plan_price"] = pd.to_numeric(df["plan_price"], errors="coerce")
    df["plan_price"] = df["plan_price"].fillna(df["plan_price"].median())

    df["customer_id"] = df["customer_id"].astype(int).astype(str)
    df["signup_date"] = pd.to_datetime(df["signup_date"], format="mixed", errors="coerce")
    df["name"] = df["name"].str.strip().str.title()
    df["feedback"] = df["feedback"].apply(clean_text)
    return df


if __name__ == "__main__":
    df = clean_dataset(RAW_FILE)

    assert df["customer_id"].is_unique, "Duplicate customer IDs remain"
    assert df["plan_price"].notnull().all(), "Missing prices remain"

    df.to_csv(CLEAN_FILE, index=False)
    print(f"Saved {len(df)} clean rows to {CLEAN_FILE}")

Next steps

You now have a repeatable way to turn any messy export into a model-ready dataset. To go further:

  1. Work through Cleaning CSV Data with Pandas for AI to handle trickier real-world CSV problems like wrong delimiters, multi-line cells, and broken encodings.
  2. Turn your cleaning script into a hands-off job by reading Automating Repetitive Tasks with Python, so new files get cleaned on a schedule without you touching them.
  3. Feed your clean records into a model by studying Understanding LLM APIs, where you will format the data into requests and manage token limits and costs.

Back to Data Cleaning for AI.