Data Cleaning and Preprocessing for HR Attrition Dataset Using Pandas
This tutorial demonstrates how to download, read, explore, visualize, and preprocess the HR attrition dataset with pandas, covering tasks such as duplicate removal, missing‑value handling, categorical encoding, normalization, and conditional column updates to prepare the data for machine‑learning modeling.
The article uses a Kaggle HR attrition CSV file to illustrate common data‑cleaning steps before applying machine‑learning models.
First, the dataset is downloaded from a GitHub link and loaded into a pandas DataFrame with import numpy as np and import pandas as pd followed by hr = pd.read_csv("HR.csv"). It emphasizes checking whether the first column is an index and whether the header row contains feature names.
Basic exploratory commands such as hr.shape, hr.head(), hr.info(), and hr.describe() are shown to understand the size, sample rows, data types, and summary statistics.
Simple visualizations are performed using hr.hist(), noting parameters like grid and figsize for histogram plots.
Duplicate rows are identified with hr.duplicated(keep="last") and removed via hr = hr.drop_duplicates(), with explanations of the keep and subset arguments.
Missing‑value strategies are presented (mean, median, zero filling) using hr.fillna(hr.mean()), hr.fillna(hr.median()), and hr.fillna(0), though the example dataset has no missing entries.
Categorical features (e.g., "sales" and "salary") are one‑hot encoded with categorical_features = ['sales', 'salary'], hr_cat = pd.get_dummies(hr[categorical_features]), followed by dropping the original columns and concatenating the new dummy variables.
Normalization is applied using scikit‑learn: Z‑score standardization with from sklearn.preprocessing import StandardScaler and ss = StandardScaler() on selected numeric columns, and Min‑Max scaling with from sklearn.preprocessing import MinMaxScaler and ss = MinMaxScaler().
Functions are provided to drop columns or rows whose non‑missing data proportion falls below a cutoff, illustrated by def drop_col(df, cutoff=0.4): ... and def drop_row(df, cutoff=0.8): ....
Finally, a conditional column update example shows how to set Work_accident values to True or False based on a threshold using
hr['Work_accident'] = hr['Work_accident'].apply(lambda x: False if x < 2 else True).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
