Fundamentals 7 min read

Practical Python Data Cleaning Functions

This article presents a collection of straightforward yet practical Python functions for data cleaning tasks—including dropping columns, changing data types, converting categorical variables, handling missing values, removing unwanted characters, trimming whitespace, conditional concatenation, and converting string timestamps—designed to streamline preprocessing in data analysis projects.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Practical Python Data Cleaning Functions

Data cleaning is often time‑consuming and tedious, yet it is essential for reliable analysis. This article gathers a set of simple, reusable Python functions that address common cleaning scenarios, allowing readers to apply them directly to their datasets.

1. Drop multiple columns

def drop_multiple_col(col_names_list, df):
    '''
    AIM    -> Drop multiple columns based on their column names

    INPUT  -> List of column names, df
    OUTPUT -> updated df with dropped columns
    ------
    '''
    df.drop(col_names_list, axis=1, inplace=True)
    return df

This function removes a list of specified columns from a DataFrame using df.drop.

2. Change data types to save memory

def change_dtypes(col_int, col_float, df):
    '''
    AIM    -> Changing dtypes to save memory

    INPUT  -> List of column names (int, float), df
    OUTPUT -> updated df with smaller memory
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')

Converts integer and float columns to lower‑precision types, reducing memory usage.

3. Convert categorical variables to numeric

def convert_cat2num(df):
    # Convert categorical variable to numerical variable
    num_encode = {
        'col_1': {'YES': 1, 'NO': 0},
        'col_2': {'WON': 1, 'LOSE': 0, 'DRAW': 0}
    }
    df.replace(num_encode, inplace=True)

Maps specified categorical values to numbers, which is required by many machine‑learning models.

4. Check missing data

def check_missing_data(df):
    # check for any missing data in the df (display in descending order)
    return df.isnull().sum().sort_values(ascending=False)

Returns a Series showing the count of missing values per column, sorted from most to least.

5. Remove unwanted characters from a column

def remove_col_str(df):
    # remove a portion of string in a dataframe column - col_1
    df['col_1'].replace('
', '', regex=True, inplace=True)

    # remove all the characters after &# (including &#) for column - col_1
    df['col_1'].replace(' &#.*', '', regex=True, inplace=True)

Handles newline characters and trailing patterns such as "&#..." in string columns.

6. Trim leading whitespace

def remove_col_white_space(df):
    # remove white space at the beginning of string
    df[col] = df[col].str.lstrip()

Strips leading spaces from the specified column.

7. Conditional concatenation of two string columns

def concat_col_str_condition(df):
    # concat 2 columns with strings if the last 3 letters of the first column are 'pil'
    mask = df['col_1'].str.endswith('pil', na=False)
    col_new = df[mask]['col_1'] + df[mask]['col_2']
    col_new.replace('pil', ' ', regex=True, inplace=True)  # replace the 'pil' with empty space

Combines two columns when the first ends with "pil" and optionally removes that suffix.

8. Convert string timestamps to datetime objects

def convert_str_datetime(df):
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)

    INPUT  -> df
    OUTPUT -> updated df with new datetime format
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

Creates a new column with proper datetime values from a string‑formatted timestamp column.

These snippets form a lightweight toolbox that can be readily incorporated into any data‑analysis workflow, helping to accelerate preprocessing and improve data quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data Sciencedata preprocessing
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.