Artificial Intelligence 12 min read

Feature Selection: Reducing Input Variables for Predictive Modeling

This article explains the purpose and types of feature selection, compares supervised and unsupervised, wrapper, filter, and embedded methods, discusses choosing statistical metrics based on variable types, and provides scikit‑learn code examples for regression and classification tasks.

Code DAO

Nov 29, 2021

Feature Selection: Reducing Input Variables for Predictive Modeling

Feature selection is the process of reducing the number of input variables when developing predictive models, aiming to lower computational cost and sometimes improve model performance.

Statistical filter methods evaluate each predictor’s relationship with the target using metrics such as Pearson correlation, Spearman rank, ANOVA F, chi‑square, and mutual information. The choice of metric depends on the data types of input and output variables (numeric, categorical, boolean, ordinal, nominal).

Feature selection is primarily focused on removing non‑informative or redundant predictors from the model.

Two main categories of feature selection techniques are supervised and unsupervised. Unsupervised methods ignore the outcome (e.g., removing redundant variables by correlation), while supervised methods use the target (e.g., discarding irrelevant variables).

An important distinction to be made in feature selection is that of supervised and unsupervised methods. When the outcome is ignored during the elimination of predictors, the technique is unsupervised.

Supervised methods can be further divided into wrapper, filter, and embedded approaches. Wrapper methods create many models with different subsets of features and select the subset that yields the best performance, independent of variable type. Filter methods score each feature using statistical measures and select those that pass a threshold. Embedded methods perform feature selection automatically during model training (e.g., lasso, decision‑tree ensembles, MARS).

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

… some models contain built‑in feature selection, meaning that the model will only include predictors that help maximize accuracy.

Statistical metrics are typically univariate, evaluating one predictor at a time, which can lead to selecting redundant but important predictors and cause collinearity problems.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

Common metrics for different variable‑type combinations include:

Numeric input & numeric output: Pearson correlation, Spearman rank.

Numeric input & categorical output: ANOVA F, Kendall tau.

Categorical input & categorical output: chi‑square, mutual information.

Scikit‑learn provides implementations for these metrics (e.g., f_regression(), f_classif(), chi2(), mutual_info_classif(), mutual_info_regression()) and utilities such as SelectKBest and SelectPercentile to apply filter‑based selection.

Example 1 – Regression feature selection (numeric input, numeric output):

# pearson's correlation feature selection for numeric input and numeric output
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression
X, y = make_regression(n_samples=100, n_features=100, n_informative=10)
fs = SelectKBest(score_func=f_regression, k=10)
X_selected = fs.fit_transform(X, y)
print(X_selected.shape)

The code creates a synthetic regression dataset, computes Pearson‑based scores with f_regression, selects the top 10 features, and prints the resulting shape (100, 10).

Example 2 – Classification feature selection (numeric input, categorical output):

# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
X, y = make_classification(n_samples=100, n_features=20, n_informative=2)
fs = SelectKBest(score_func=f_classif, k=2)
X_selected = fs.fit_transform(X, y)
print(X_selected.shape)

This creates a synthetic classification dataset, uses ANOVA F scores via f_classif, selects the top 2 features, and outputs shape (100, 2).

When applying filter‑based selection, practitioners should consider data type compatibility, possible transformations (e.g., binning numeric variables for categorical tests), and assumptions of each statistical test (e.g., Pearson assumes linearity and normality).

Further sections will review additional statistical measures for filter‑based feature selection and discuss tips for handling different variable types.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature selection scikit-learn embedded methods filter methods wrapper methods statistical metrics

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.