Mastering Feature Selection: From Filters to Embedded Methods in Python
This article explains why feature selection is crucial for machine learning, outlines the general workflow, compares filter, wrapper, embedded, and synthesis approaches, and provides practical Python examples—including Pearson correlation, chi‑square tests, mutual information, variance selection, recursive elimination, L1 regularization, and PCA—complete with code snippets and visualizations.
General Process of Feature Selection
Feature selection involves generating candidate subsets, defining an evaluation function, setting a stopping criterion, and validating the chosen subset on a validation set.
Generate subsets: search for feature subsets to feed the evaluation function.
Define evaluation function: assess the quality of a feature subset.
Set stopping criterion: usually a threshold on the evaluation score.
Validate: test the selected subset on a validation dataset.
Enumerating all subsets is infeasible when the number of features is large, so heuristics and experience are needed.
Common Feature Selection Methods
Feature selection methods can be grouped into three major categories:
Filter : score each feature based on divergence or correlation and select based on a threshold or a fixed number.
Wrapper : use an evaluation function to iteratively add or remove features.
Embedded : train a model to obtain feature importance coefficients and select features accordingly.
Feature Synthesis : dimensionality‑reduction techniques such as Principal Component Analysis (PCA) combine original features into new ones.
Filter Method
The filter approach evaluates each feature independently, computes its information with respect to the target, ranks the results, and selects the top‑k features.
Evaluation Methods
Correlation coefficients (e.g., Pearson, Kendall).
Chi‑square test.
Mutual information and maximal information coefficient.
Distance correlation.
Variance selection (discard low‑variance features).
Pearson Correlation Coefficient
Pearson correlation measures linear relationship between a feature and the response, ranging from –1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear correlation.
<code>from scipy.stats import pearsonr
x1 = [51,80,95,19,73,84,65,30,1,35,13,61,36,65,57,40,15,73,58,62]
# feature 1 data
x2 = [7.0,27.5,23.0,32.0,15.5,44.0,10.5,29.5,36.0,47.5,27.0,28.5,26.5,41.5,12.5,0.5,19.0,48.5,0.5,24.0]
# feature 2 data
y = [14,64,54,72,36,92,24,62,72,95,55,64,60,84,33,5,40,99,2,48]
print(pearsonr(x1, y))
print(pearsonr(x2, y))
</code>Results:
PearsonRResult(statistic=0.0209, pvalue=0.9304) – almost no linear correlation. PearsonRResult(statistic=0.9939, pvalue=1.05e‑18) – very strong linear correlation.
Scatter plots illustrate the difference:
Pearson correlation only captures linear relationships; non‑linear dependencies may yield a near‑zero coefficient. For example, a quadratic relationship has Pearson = 0 but is clearly dependent.
<code>x3 = range(-10, 11)
y2 = [x**2 for x in x3]
plt.scatter(x3, y2, label='Pearson=0.0')
plt.tight_layout()
plt.savefig('images/feature0102.png')
plt.show()
</code>Chi‑Square Test
The chi‑square test compares observed frequencies with expected frequencies under the assumption of independence.
Formulate null hypothesis (features are independent) and alternative hypothesis (features are dependent).
Compute expected frequencies based on marginal totals.
Calculate the chi‑square statistic.
Determine degrees of freedom (for a 2×2 table, df = 1).
Define the rejection region.
Compare p‑value with significance level to accept or reject the null hypothesis.
Example: A study of 500 corrupt officials and 590 honest officials examined whether economic honesty is independent of lifespan. The chi‑square test shows a highly significant association.
<code>import pandas as pd
import scipy.stats as stats
df = pd.DataFrame([[348,152],[93,497]], index=['Corrupt','Honest'], columns=['Short','Long'])
chi2, p, dof, expected = stats.chi2_contingency(df)
print('Chi2Result(statistic={}, pvalue={}, dof={}, expected={})'.format(chi2, p, dof, expected))
</code>Result: chi‑square statistic ≈ 323.40, p‑value ≈ 2.63e‑72 → reject the null hypothesis; economic honesty and lifespan are related.
Feature Selection with Scikit‑Learn
Using SelectKBest with the chi‑square score on the Iris dataset:
<code>from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2
iris = load_iris()
X, y = iris.data, iris.target
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new[:5])
</code>The two selected features are petal length and petal width.
Mutual Information and Maximal Information Coefficient
Mutual information quantifies the dependence between two variables. A value of zero indicates independence, while larger values indicate stronger relationships.
Variance Selection Method
Features with variance below a chosen threshold are discarded; normalization may be applied beforehand to handle differing scales.
Wrapper Method
Wrapper methods evaluate subsets by training a model on each candidate set and selecting the subset that yields the best performance. Common strategies include forward selection, backward elimination, and recursive feature elimination (RFE).
<code>from sklearn.datasets import load_iris
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
X, y = load_iris(return_X_y=True)
selector = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)
X_rfe = selector.fit_transform(X, y)
print(X_rfe.shape)
</code>The RFE process retains the five most important features.
<code>array([[1.471e-01, 2.538e+01, 1.846e+02, 2.019e+03, 2.654e-01],
[7.017e-02, 2.499e+01, 1.588e+02, 1.956e+03, 1.860e-01],
...]
</code>Embedded Method
Penalty‑Based Feature Selection
L1 regularization (Lasso) yields sparse solutions and can be used for feature selection. Logistic regression with an L1 penalty selects features automatically.
<code>from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
feature_new = SelectFromModel(LogisticRegression(penalty='l1', solver='liblinear', C=0.25)).fit_transform(X, y)
print(feature_new.shape)
</code>Resulting shape: (569, 7) – seven features remain after elimination. Adjusting the regularization parameter C changes the number of selected features.
Feature Synthesis – PCA
Principal Component Analysis reduces dimensionality by projecting data onto orthogonal components that capture maximal variance.
Standardize the data and compute the covariance matrix.
Obtain eigenvalues and eigenvectors of the covariance matrix.
Select the top‑k principal components based on cumulative explained variance.
Project the original data onto the selected components.
<code>from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
cancer = load_breast_cancer()
X = cancer.data
pca = PCA(n_components=5)
pca.fit(X)
print('Explained variance:', pca.explained_variance_)
print('Explained variance ratio:', pca.explained_variance_ratio_)
</code>The first component alone explains about 98% of the variance.
<code>import matplotlib.pyplot as plt
plt.plot(range(1,6), pca.explained_variance_ratio_, label='Explained variance')
plt.plot(range(1,6), pca.explained_variance_ratio_.cumsum(), label='Cumulative variance')
plt.legend()
plt.tight_layout()
plt.savefig('images/feature0104.png')
plt.show()
</code>This concludes the overview of feature selection techniques.
Sun Jiawei, "Feature Selection (Feature Selection) Methods Summary", Zhihu.
Model Perspective, "PCA Comprehensive Evaluation (with Python code)", WeChat.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.