Artificial Intelligence 18 min read

Essential Feature Selection Techniques for Machine Learning

This article explains why feature selection is crucial for building robust machine‑learning models and walks through popular filter, wrapper, and embedded methods—including information gain, chi‑square, LASSO, random‑forest importance, and PCA—providing code examples and practical guidance.

Code DAO

Dec 18, 2021

Essential Feature Selection Techniques for Machine Learning

Datasets often contain many features, but not all contribute to a model’s predictive power; unnecessary features can lower accuracy, increase complexity, and reduce generalization. Feature selection is therefore a necessary step in constructing effective machine‑learning models.

Filter Methods

Filter methods evaluate each feature independently using statistical measures without cross‑validation. They are low‑cost and effective at removing irrelevant and redundant features, though they do not address multicollinearity.

Common filter techniques demonstrated include:

Information Gain – measures the reduction in entropy provided by a feature. Example code:

from sklearn.feature_selection import mutual_info_classif
importances = mutual_info_classif(X, Y)
feat_importances = pd.Series(importances, dataframe.columns[0:len(dataframe.columns)-1])
feat_importances.plot(kind='barh', color='teal')
plt.show()

Chi‑square Test – assesses the relationship between categorical features and the target. Example code:

from sklearn.feature_selection import SelectKBest, chi2
X_cat = X.astype(int)
chi2_features = SelectKBest(chi2, k=3)
X_kbest_features = chi2_features.fit_transform(X_cat, Y)
print('Original feature number:', X_cat.shape[1])
print('Reduced feature number:', X_kbest_features.shape[1])

Fisher Score – ranks features by their Fisher discriminant value. Example code:

from skfeature.function.similarity_based import fisher_score
ranks = fisher_score.fisher_score(X, Y)
feat_importances = pd.Series(ranks, dataframe.columns[0:len(dataframe.columns)-1])
feat_importances.plot(kind='barh', color='teal')
plt.show()

Correlation Coefficient – uses Pearson correlation to identify linear relationships with the target. Example code:

import seaborn as sns
cor = dataframe.corr()
sns.heatmap(cor, annot=True)
plt.show()

Variance Threshold – removes features with variance below a specified threshold (default removes zero‑variance features). Example code:

from sklearn.feature_selection import VarianceThreshold
v_threshold = VarianceThreshold(threshold=0)
v_threshold.fit(X)
selected = v_threshold.get_support()

Mean Absolute Deviation (MAD) – similar to variance but based on absolute deviations. Example code:

mean_abs_diff = np.sum(np.abs(X - np.mean(X, axis=0)), axis=0) / X.shape[0]
plt.bar(np.arange(X.shape[1]), mean_abs_diff, color='teal')
plt.show()

Dispersion Ratio – ratio of arithmetic mean to geometric mean; higher values indicate stronger relevance. Example code:

X = X + 1
am = np.mean(X, axis=0)
gm = np.prod(X, axis=0) ** (1 / X.shape[0])
disp_ratio = am / gm
plt.bar(np.arange(X.shape[1]), disp_ratio, color='teal')
plt.show()

Wrapper Methods

Wrapper methods search the space of feature subsets by training and evaluating a classifier for each subset. They are more accurate than filter methods but computationally expensive.

Techniques covered include:

Forward Feature Selection – starts with an empty set and adds the feature that most improves the model at each iteration.

# Forward Feature Selection
from mlxtend.feature_selection import SequentialFeatureSelector
ffs = SequentialFeatureSelector(Ir, k_features='best', forward=True, n_jobs=-1)
ffs.fit(X, Y)
features = list(map(int, ffs.k_feature_names_))
lr.fit(x_train[features], y_train)

Backward Feature Elimination – starts with all features and removes the least important one iteratively.

# Backward Feature Selection
from sklearn.linear_model import LogisticRegression
from mixtend.feature_selection import SequentialFeatureSelector
lr = LogisticRegression(solver='lbfgs', class_weight='balanced', random_state=42, n_jobs=-1, max_iter=500)
Ir.fit(X, Y)
bfs = SequentialFeatureSelector(Ir, k_features='best', forward=False, n_jobs=-1)
bfs.fit(X, Y)
features = list(map(int, bfs.k_feature_names_))
lr.fit(x_train[features], y_train)

Bidirectional Elimination – combines forward addition and backward removal to converge on a single solution.

Exhaustive Feature Selection – evaluates every possible subset (slow for many features). Example code (note the typo in the original source):

# Exhaustive Feature Selection
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import RandomForestClassifier
efs = ExhaustiveFeatureSelector(RandomForestClassifier(),
                 min_features=4,
                 max_features=8,
                 scoring='roc_auc',
                 cv=2)
efs = efs.fit(X, Y)
selected_features = x_train.columns[efs.best_idx_]
print(selected_features)
print('Best score:', efs.best_score_)

Recursive Feature Elimination (RFE) – repeatedly removes the least important features based on model‑derived importance.

# Recursive Feature Elimination
from sklearn.feature_selection import RFE
rfe = RFE(Ir, n_features_to_select=7)
rfe.fit(x_train, y_train)
y_pred = rfe.predict(x_train)

Embedded Methods

Embedded methods integrate feature selection into the model training process, combining advantages of filter and wrapper approaches.

Regularization‑based techniques discussed:

LASSO (L1) – adds an L1 penalty, driving some coefficients to zero, thus performing feature selection.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
logistic = LogisticRegression(C=1, penalty='l1', solver='liblinear', random_state=7).fit(X, Y)
model = SelectFromModel(logistic, prefit=True)
X_new = model.transform(X)

RIDGE (L2) – penalizes large coefficients but does not zero them out; useful for reducing multicollinearity but not for feature reduction.

Elastic Net – combines L1 and L2 penalties; the mixing parameter α controls the balance (α=1 → LASSO, α=0 → RIDGE). Cross‑validation can tune α.

Tree‑based embedded methods such as Random Forest and Gradient Boosting compute feature importance based on impurity reduction (Gini or entropy). Example code:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=340)
model.fit(X, Y)
importances = model.feature_importances_
final_df = pd.DataFrame({'Features': pd.DataFrame(X).columns, 'Importances': importances})
final_df = final_df.sort_values('Importances')
final_df.plot.bar(color='teal')
plt.show()

Practical Considerations

Beyond algorithmic methods, the article highlights seven practical ways to choose optimal features:

Domain Knowledge – leveraging expert insight to select relevant variables (e.g., vehicle year, model, and license type for car‑price prediction).

Handling Missing Values – removing or imputing features with excessive missing data; illustrated with the Titanic dataset where the “cabin” feature is dropped.

Correlation with Target – using Pearson, Spearman, or Kendall coefficients to identify features strongly linked to the label (e.g., sex, Pclass, fare in Titanic).

Inter‑Feature Correlation – detecting multicollinearity that can degrade model performance (e.g., high correlation between Pclass and fare).

Principal Component Analysis (PCA) – reduces dimensionality while preserving variance; the article shows that 15 components retain 90 % of variance on the Ionosphere dataset.

Forward Feature Selection – iteratively adds features that improve validation performance until a stopping criterion is met.

Feature Importance Scores – derived from many scikit‑learn models to rank variables for robust training.

Conclusion

Without feature selection, model development is incomplete because unnecessary features can impair performance. The article surveys filter, wrapper, and embedded approaches, presenting seven widely used techniques and illustrating each with code snippets and visualizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning PCA feature selection Regularization embedded methods filter methods tree-based importance wrapper methods

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.