Data Cleaning Techniques in Python: 21 Practical Examples and Code
This article provides a comprehensive guide to data cleaning in Python, covering common data issues, methods for handling missing values, duplicates, categorical inconsistencies, and text normalization, illustrated with 21 detailed code examples using pandas and matplotlib.
Data cleaning is the process of identifying and correcting errors and inconsistencies in datasets to enable reliable analysis and reporting.
Key characteristics of clean data include accuracy, completeness, consistency, integrity, timeliness, uniformity, and validity.
The article explores four broad topics with practical Python examples:
Common data problems such as type mismatches, range violations, and duplicate records.
Text and categorical data issues, including inconsistent categories and relationship constraints.
Advanced data problems like cross‑domain validation and overall integrity.
Record linking techniques for merging and reconciling datasets.
Example 1 – Type conversion
<code># Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
# Get data types of columns
sales.dtypes</code>When a numeric column is stored as a string (e.g., '$1000' ), the article shows how to strip the dollar sign and convert the column to int before aggregation.
<code>sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')</code>Example 2 – Categorical encoding errors
<code># 0 = Never married 1 = Married 2 = Separated 3 = Divorced</code>It demonstrates converting a numeric representation of marital status to a categorical type and re‑running descriptive statistics.
<code>df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()</code>Example 3 – Out‑of‑range values
<code># Output Movies with rating > 5
movies[movies['avg_rating'] > 5]</code>Outliers are filtered or capped using pandas indexing and .loc assignment.
<code># Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5</code>Example 4 – Future dates
<code>import datetime as dt
today_date = dt.date.today()
user_signups = user_signups[user_signups['subscription_date'] < today_date]</code>Dates stored as strings are converted to datetime objects before filtering.
Duplicate detection
<code>duplicates = height_weight.duplicated()
print(duplicates)</code>The article explains using duplicated() with subset and keep parameters, and removing duplicates via drop_duplicates() .
<code>height_weight.drop_duplicates(inplace = True)</code>Category consistency
<code># Identify inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)</code>Inconsistent rows are isolated and either corrected or removed using boolean indexing.
<code>consistent_data = study_data[~inconsistent_rows]</code>Text data cleaning
<code># Standardize case
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
# Strip whitespace
demographics = demographics['marriage_status'].str.strip()</code>Phone numbers are normalized by replacing '+' with '00' , removing dashes, and setting short entries to NaN .
<code># Replace '+' with '00'
phones["Phone number"] = phones["Phone number"].str.replace("+","00")
# Remove dashes
phones["Phone number"] = phones["Phone number"].str.replace("-","")
# Set numbers with fewer than 10 digits to NaN
phones.loc[phones['Phone number'].str.len() < 10, "Phone number"] = np.nan</code>Overall, the guide provides a step‑by‑step workflow for cleaning raw data, preparing it for analysis, and ensuring high‑quality insights.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.