Data Cleaning Techniques in Python: 21 Practical Examples and Code
This tutorial explains data cleaning concepts, key quality dimensions, and demonstrates 21 practical Python examples—including regex phone cleaning, temperature conversion, missing‑value detection, visualization with missingno, and record linkage using fuzzy matching—providing clear code snippets and step‑by‑step guidance for reliable data analysis.
Data cleaning is the process of identifying and correcting errors and inconsistencies in datasets to enable reliable analysis, giving analysts clearer insight into business operations and improving organizational efficiency.
Key quality dimensions include accuracy, completeness, consistency, integrity, timeliness, uniformity, and validity, each measuring different aspects of data cleanliness.
Example 11 – Phone number cleaning with regex: Using pandas string replacement to strip non‑digit characters from a phone column.
<code># Replace letters with nothing</code><code>phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')</code><code>phones.head()</code>Example 12 – Temperature conversion: Load a temperature CSV, visualize with matplotlib, and convert Fahrenheit values above 40 °F to Celsius.
<code># Import matplotlib</code><code>import matplotlib.pyplot as plt</code><code># Create scatter plot</code><code>plt.scatter(x = 'Date', y = 'Temperature', data = temperatures)</code><code># Create title, xlabel and ylabel</code><code>plt.title('Temperature in Celsius March 2019 - NYC')</code><code>plt.xlabel('Dates')</code><code>plt.ylabel('Temperature in Celsius')</code><code># Show plot</code><code>plt.show()</code>Conversion logic:
<code>temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']</code><code>temp_cels = (temp_fah - 32) * (5/9)</code><code>temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels</code><code># Assert conversion is correct</code><code>assert temperatures['Temperature'].max() < 40</code>Missing data handling: Detect missing values with isna() , summarize with isna().sum() , and visualize patterns using the missingno library.
<code># Return missing values</code><code>airquality.isna()</code> <code># Get summary of missingness</code><code>airquality.isna().sum()</code> <code># Visualize missingness</code><code>import missingno as msno</code><code>import matplotlib.pyplot as plt</code><code>msno.matrix(airquality)</code><code>plt.show()</code>Simple strategies for handling missing data include dropping rows ( dropna() ), imputing with statistical values ( fillna() ), or using advanced algorithms and machine‑learning models.
<code># Drop rows with missing CO2</code><code>airquality_dropped = airquality.dropna(subset = ['CO2'])</code> <code># Impute missing CO2 with mean</code><code>co2_mean = airquality['CO2'].mean()</code><code>airquality_imputed = airquality.fillna({'CO2': co2_mean})</code>Record linkage and fuzzy matching: Use string similarity (Levenshtein distance) via the thefuzz package and the recordlinkage library to deduplicate and merge datasets without a common unique identifier.
<code># Compare two strings</code><code>from thefuzz import fuzz</code><code>fuzz.WRatio('Reeding', 'Reading')</code> <code># Record linkage example</code><code>import recordlinkage</code><code>indexer = recordlinkage.Index()</code><code>indexer.block('state')</code><code>pairs = indexer.index(census_A, census_B)</code><code>compare_cl = recordlinkage.Compare()</code><code>compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')</code><code>compare_cl.exact('state', 'state', label='state')</code><code>compare_cl.string('surname', 'surname', threshold=0.85, label='surname')</code><code>compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')</code><code>potential_matches = compare_cl.compute(pairs, census_A, census_B)</code>After scoring potential matches, filter by a threshold, extract duplicate indices, and append non‑duplicate rows to create a unified dataset.
<code># Identify duplicates</code><code>matches = potential_matches[potential_matches.sum(axis=1) >= 3]</code><code>duplicate_rows = matches.index.get_level_values(1)</code><code># Separate new rows</code><code>census_B_new = census_B[~census_B.index.isin(duplicate_rows)]</code><code># Merge datasets</code><code>full_census = census_A.append(census_B_new)</code>These examples illustrate a comprehensive approach to data cleaning, from basic preprocessing to advanced record linkage, enabling analysts to prepare high‑quality data for accurate insights.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.