Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python
This tutorial demonstrates how to perform essential data preprocessing techniques—including standardization, min‑max normalization, and various missing‑value imputation methods—using pandas and scikit‑learn in Python, providing code examples and explanations to help you prepare datasets for machine‑learning models.
Goal : Learn data preprocessing techniques.
Learning Content : Standardization, min‑max normalization, and missing‑value filling.
Code Example :
1. Import required libraries
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler2. Create an example dataset
data = { '姓名': ['张三', '李四', '王五', '赵六', '孙七'], '年龄': [25, 30, 22, 35, 28], '收入': [5000, 7000, 6000, 8000, 6500], '身高': [170, 175, 165, 180, 172] } df = pd.DataFrame(data) print(f"示例数据集:
{df}")3. Standardization using
StandardScaler scaler = StandardScaler() df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']]) print(f"标准化后的数据集:
{df}")4. Normalization using
MinMaxScaler scaler = MinMaxScaler() df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']]) print(f"归一化后的数据集:
{df}")5. Check missing values
missing_values = df.isnull().sum() print(f"每列的缺失值数量:
{missing_values}")6. Generate dataset with missing values
np.random.seed(0) df['收入'][np.random.randint(0, len(df), 2)] = np.nan print(f"带有缺失值的数据集:
{df}")7. Fill missing values
# Mean imputation df['收入'].fillna(df['收入'].mean(), inplace=True) print(f"使用均值填充缺失值后的数据集:
{df}") # Median imputation df['收入'].fillna(df['收入'].median(), inplace=True) print(f"使用中位数填充缺失值后的数据集:
{df}") # Forward fill df['收入'].fillna(method='ffill', inplace=True) print(f"使用前向填充后的数据集:
{df}") # Backward fill df['收入'].fillna(method='bfill', inplace=True) print(f"使用后向填充后的数据集:
{df}")Practice : Apply the above steps to a dataset to perform standardization, normalization, and missing‑value imputation.
Summary : After completing the exercises, you should be able to preprocess data by scaling features to a common range and handling missing entries with various strategies, which are essential for improving model performance in machine‑learning projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
