Feature Selection and Feature Engineering with Python (Filter, Wrapper, and Embedded Methods)
This tutorial teaches how to perform feature selection using filter, wrapper, and embedded methods and how to construct new features such as interaction, non‑linear, binned, and binary features with Python's pandas and scikit‑learn libraries.
Goal : Learn feature selection and feature construction techniques.
Learning Content : Filter, wrapper, and embedded feature selection methods; various feature construction strategies.
Code Example :
import pandas as pd import numpy as np from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification
# Create example dataset X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=42) feature_names = [f'特征{i+1}' for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) df['标签'] = y print(f"示例数据集: \n{df.head()}")
# Filter method: SelectKBest with f_classif selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = [feature_names[i] for i in selector.get_support(indices=True)] print(f"使用 SelectKBest 选择的特征: {selected_features}")
# Wrapper method: Recursive Feature Elimination (RFE) model = LogisticRegression() selector = RFE(model, n_features_to_select=5, step=1) selector.fit(X, y) selected_features = [feature_names[i] for i in selector.support_] print(f"使用 RFE 选择的特征: {selected_features}")
# Embedded method: Random Forest feature importance model = RandomForestClassifier() model.fit(X, y) importances = model.feature_importances_ indices = np.argsort(importances)[-5:] # select top 5 features selected_features = [feature_names[i] for i in indices] print(f"使用随机森林选择的特征: {selected_features}")
# Feature construction examples # Interaction feature: product of feature1 and feature2 df['特征1_特征2_乘积'] = df['特征1'] * df['特征2'] print(f"创建新特征后的数据集: \n{df.head()}") # Interaction feature: sum of feature1 and feature3 df['特征1_特征3_和'] = df['特征1'] + df['特征3'] # Non‑linear feature: square of feature2 df['特征2_平方'] = df['特征2'] ** 2 # Binned feature: binning feature1 bins = [0, 0.5, 1, 1.5, 2] labels = ['低', '中低', '中高', '高'] df['特征1_分箱'] = pd.cut(df['特征1'], bins=bins, labels=labels) # Binary feature: binarize feature3 based on its mean df['特征3_二值化'] = (df['特征3'] > df['特征3'].mean()).astype(int) print(f"创建二值化特征后的数据集: \n{df.head()}")
Practice : Apply the above feature selection and construction steps to any dataset.
Summary : After completing the exercises, you should be able to select important features using filter, wrapper, and embedded methods and enrich your dataset by creating new, interaction, non‑linear, binned, and binary features.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.