Practical Python Data Cleaning Functions
This article presents a collection of straightforward yet practical Python functions for data cleaning tasks—including dropping columns, changing data types, converting categorical variables, handling missing values, removing unwanted characters, trimming whitespace, conditional concatenation, and converting string timestamps—designed to streamline preprocessing in data analysis projects.
Data cleaning is often time‑consuming and tedious, yet it is essential for reliable analysis. This article gathers a set of simple, reusable Python functions that address common cleaning scenarios, allowing readers to apply them directly to their datasets.
1. Drop multiple columns
def drop_multiple_col(col_names_list, df):
'''
AIM -> Drop multiple columns based on their column names
INPUT -> List of column names, df
OUTPUT -> updated df with dropped columns
------
'''
df.drop(col_names_list, axis=1, inplace=True)
return dfThis function removes a list of specified columns from a DataFrame using df.drop.
2. Change data types to save memory
def change_dtypes(col_int, col_float, df):
'''
AIM -> Changing dtypes to save memory
INPUT -> List of column names (int, float), df
OUTPUT -> updated df with smaller memory
------
'''
df[col_int] = df[col_int].astype('int32')
df[col_float] = df[col_float].astype('float32')Converts integer and float columns to lower‑precision types, reducing memory usage.
3. Convert categorical variables to numeric
def convert_cat2num(df):
# Convert categorical variable to numerical variable
num_encode = {
'col_1': {'YES': 1, 'NO': 0},
'col_2': {'WON': 1, 'LOSE': 0, 'DRAW': 0}
}
df.replace(num_encode, inplace=True)Maps specified categorical values to numbers, which is required by many machine‑learning models.
4. Check missing data
def check_missing_data(df):
# check for any missing data in the df (display in descending order)
return df.isnull().sum().sort_values(ascending=False)Returns a Series showing the count of missing values per column, sorted from most to least.
5. Remove unwanted characters from a column
def remove_col_str(df):
# remove a portion of string in a dataframe column - col_1
df['col_1'].replace('
', '', regex=True, inplace=True)
# remove all the characters after &# (including &#) for column - col_1
df['col_1'].replace(' &#.*', '', regex=True, inplace=True)Handles newline characters and trailing patterns such as "&#..." in string columns.
6. Trim leading whitespace
def remove_col_white_space(df):
# remove white space at the beginning of string
df[col] = df[col].str.lstrip()Strips leading spaces from the specified column.
7. Conditional concatenation of two string columns
def concat_col_str_condition(df):
# concat 2 columns with strings if the last 3 letters of the first column are 'pil'
mask = df['col_1'].str.endswith('pil', na=False)
col_new = df[mask]['col_1'] + df[mask]['col_2']
col_new.replace('pil', ' ', regex=True, inplace=True) # replace the 'pil' with empty spaceCombines two columns when the first ends with "pil" and optionally removes that suffix.
8. Convert string timestamps to datetime objects
def convert_str_datetime(df):
'''
AIM -> Convert datetime(String) to datetime(format we want)
INPUT -> df
OUTPUT -> updated df with new datetime format
------
'''
df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))Creates a new column with proper datetime values from a string‑formatted timestamp column.
These snippets form a lightweight toolbox that can be readily incorporated into any data‑analysis workflow, helping to accelerate preprocessing and improve data quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
