11 Powerful Feature Selection Techniques Every Data Scientist Should Master
This guide walks through a comprehensive set of feature‑selection strategies—from removing unused or missing columns to handling multicollinearity, low‑variance features, and using PCA—complete with Python code examples and visualizations to help you build leaner, more interpretable machine‑learning models.
Too many features increase model complexity and over‑fitting, while too few lead to under‑fitting. Feature selection aims to keep the model just complex enough to generalize well while remaining easy to train, maintain, and interpret.
Feature selection means retaining some features and discarding others. This article outlines several feature‑selection strategies:
Delete unused columns
Delete columns with missing values
Remove irrelevant features
Drop low‑variance features
Handle multicollinearity
Use feature coefficients (beta values)
Apply p‑values for statistical significance
Calculate Variance Inflation Factor (VIF)
Select features based on importance (tree‑based models)
Automatic selection with scikit‑learn
Principal Component Analysis (PCA)
1. Delete Unused Columns
The simplest strategy is intuition: if you know a column (e.g., ID, FirstName) will never be used, drop it. In the demo dataset no such columns exist, so none are removed.
2. Delete Columns with Missing Values
Missing values are unacceptable for most ML algorithms. If a column has many missing entries, it is often best to drop it entirely.
# total null values per column
df.isnull().sum()3. Irrelevant Features
Features must be correlated with the target. For numeric features, correlation can be visualized with a bar plot.
# correlation between target and features
(df.corr().loc['price'].plot(kind='barh', figsize=(4,10)))Features such as peak‑rpm , compression‑ratio , stroke , bore , and symboling show almost no correlation with price and can be removed. A correlation threshold (e.g., 0.2) can be applied programmatically:
# drop uncorrelated numeric features (threshold < 0.2)
corr = abs(df.corr().loc['price'])
corr = corr[corr < 0.2]
cols_to_drop = corr.index.tolist()
df = df.drop(cols_to_drop, axis=1)4. Low‑Variance Features
Check the variance of numeric features and drop those with extremely low variance.
# variance of numeric features
(df.select_dtypes(include='number').var().astype('str'))The feature bore has very low variance but is kept for demonstration.
df['bore'].describe()5. Multicollinearity
When two features are highly correlated, they introduce multicollinearity. For example, engine size and horsepower are strongly related. A heatmap can reveal such relationships.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(df.corr(), annot=True, linewidths=.5, center=0, cbar=False, cmap="PiYG")
plt.show()Features with correlation > 0.80 can be manually or programmatically removed:
# drop correlated features
df = df.drop(['length', 'width', 'curb-weight', 'engine-size', 'city-mpg'], axis=1)Variance Inflation Factor (VIF) can also be used to detect multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.Series([variance_inflation_factor(df.values, i) for i in range(df.shape[1])], index=df.columns)6. Feature Coefficients
For regression tasks, the magnitude of coefficients (beta values) indicates each feature’s contribution. After fitting a linear model, coefficients can be visualized and small‑magnitude features filtered out.
# feature coefficients
coeffs = model.coef_
index = X_train.columns.tolist()
(pd.DataFrame(coeffs, index=index, columns=['coeff']).sort_values(by='coeff')
.plot(kind='barh', figsize=(4,10)))# filter near‑zero coefficient features
temp = pd.DataFrame(coeffs, index=index, columns=['coeff']).sort_values(by='coeff')
temp = temp[(temp['coeff']>1) | (temp['coeff']<-1)]
cols_coeff = temp.index.tolist()
X_train = X_train[cols_coeff]
X_test = X_test[cols_coeff]7. p‑Values
In regression, p‑values assess whether a predictor is statistically significant. Using statsmodels:
import statsmodels.api as sm
ols = sm.OLS(y, X).fit()
print(ols.summary())Features with non‑significant p‑values can be removed iteratively to improve adjusted R².
8. Variance Inflation Factor (VIF)
VIF quantifies multicollinearity. Rough guidelines: VIF=1 (no correlation), 1‑5 (moderate), >5 (high). Features with VIF > 10 are dropped.
# calculate VIF
vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)9. Feature‑Importance‑Based Selection
Tree‑based models provide feature_importances_. A random forest can be trained and importance visualized.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=0)
model.fit(X, y)
importances = model.feature_importances_
(pd.DataFrame(importances, X.columns, columns=['importance']).sort_values(by='importance', ascending=True).plot(kind='barh', figsize=(4,10))Standard deviation across trees can be added as error bars.
std = np.std([i.feature_importances_ for i in model.estimators_], axis=0)
feat_with_importance = pd.Series(importances, X.columns)
fig, ax = plt.subplots(figsize=(12,5))
feat_with_importance.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances")
ax.set_ylabel("Mean decrease in impurity")10. Automatic Feature Selection with Scikit‑Learn
Scikit‑learn offers several wrappers:
SelectKBest / chi‑square
SelectPercentile
SelectFromModel (e.g., L1‑regularized LinearSVC)
SequentialFeatureSelector (forward/backward)
# select K best features (chi2)
X_best = SelectKBest(chi2, k=10).fit_transform(X, y)
# keep top 75% of features
X_top = SelectPercentile(chi2, percentile=75).fit_transform(X, y) # L1‑regularized LinearSVC + SelectFromModel
from sklearn.svm import LinearSVC
model = LinearSVC(penalty='l1', C=0.002, dual=False)
model.fit(X, y)
selector = SelectFromModel(estimator=model, prefit=True)
X_new = selector.transform(X)
feature_names = np.array(X.columns)
selected = feature_names[selector.get_support()] # backward sequential selection with RandomForest
model = RandomForestClassifier(n_estimators=100, random_state=0)
selector = SequentialFeatureSelector(estimator=model, n_features_to_select=10, direction='backward', cv=2)
selector.fit_transform(X, y)
feature_names = np.array(X.columns)
selected = feature_names[selector.get_support()]11. Principal Component Analysis (PCA)
PCA reduces dimensionality by projecting data onto orthogonal components that capture most variance.
from sklearn.decomposition import PCA
X_scaled = scaler.fit_transform(X)
pca = PCA()
pca.fit(X_scaled)
evr = pca.explained_variance_ratio_
plt.figure(figsize=(12,5))
plt.plot(range(len(evr)), evr.cumsum(), marker='o', linestyle='--')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')In the demo, 20 components explain over 80 % of variance, so the model can be trained on these 20 principal components.
Summary
This guide provides a useful overview of various feature‑selection techniques. Before fitting a model, you can drop columns with many missing values, irrelevant or highly collinear features, and apply dimensionality reduction with PCA. After a baseline model is built, you can further prune features using coefficients, p‑values, VIF, and importance scores. While you won’t use every strategy in a single project, these methods give you a solid toolbox for creating efficient, interpretable models.
Source: DeepHub IMBA
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
