Artificial Intelligence 14 min read

Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

This tutorial explains why and how to reduce the number of features in the Kaggle Mushroom Classification dataset with Python, covering preprocessing, various feature‑selection methods (filter, wrapper, embedded), code examples, model training, performance impact, and visualisation of results.

DataFunTalk
DataFunTalk
DataFunTalk
Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

According to Forbes, about 2.5 million bytes of data are generated daily; before any statistical analysis, the raw data must be pre‑processed. This article provides a plain‑language guide on using Python to reduce the feature count of the Kaggle Mushroom Classification dataset.

Reducing the number of features used during statistical analysis can improve model accuracy, lower over‑fitting risk, speed up training, enhance data visualisation, and increase model interpretability.

Feature‑selection methods fall into three categories: (1) filter methods that select subsets based on statistical measures such as Pearson correlation; (2) wrapper methods that evaluate subsets with a machine‑learning model (e.g., forward/backward/recursive feature elimination); and (3) embedded methods that rank features during model training.

Data loading and preprocessing:

X = df.drop(['class'], axis=1)
Y = df['class']
X = pd.get_dummies(X, prefix_sep='_')
Y = LabelEncoder().fit_transform(Y)
X2 = StandardScaler().fit_transform(X)
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X2, Y, test_size=0.30, random_state=101)

Training a RandomForestClassifier on all features yields about 2.2 seconds of training time and 100 % accuracy. Feature importance is plotted, showing the top seven features.

Using only the three most important features reduces training time by half while decreasing accuracy by only 0.03 %.

A decision‑tree visualisation further confirms that the top features are the most influential for classification.

Recursive Feature Elimination (RFE) :

from sklearn.feature_selection import RFE
model = RandomForestClassifier(n_estimators=700)
rfe = RFE(model, 4)
start = time.process_time()
RFE_X_Train = rfe.fit_transform(X_Train, Y_Train)
RFE_X_Test = rfe.transform(X_Test)
rfe = rfe.fit(RFE_X_Train, Y_Train)
print(time.process_time() - start)
print("Overall Accuracy using RFE:", rfe.score(RFE_X_Test, Y_Test))

SelectFromModel with an ExtraTreesClassifier:

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
model = ExtraTreesClassifier()
start = time.process_time()
model = model.fit(X_Train, Y_Train)
model = SelectFromModel(model, prefit=True)
print(time.process_time() - start)
Selected_X = model.transform(X_Train)
trainedforest = RandomForestClassifier(n_estimators=700).fit(Selected_X, Y_Train)
Selected_X_Test = model.transform(X_Test)
predictionforest = trainedforest.predict(Selected_X_Test)
print(confusion_matrix(Y_Test, predictionforest))
print(classification_report(Y_Test, predictionforest))

Correlation‑matrix analysis selects features whose absolute correlation with the target exceeds 0.5:

Numeric_df = pd.DataFrame(X)
Numeric_df['Y'] = Y
corr = Numeric_df.corr()
corr_y = abs(corr["Y"])
highest_corr = corr_y[corr_y > 0.5]
highest_corr.sort_values(ascending=True)

Univariate feature selection using chi2 (SelectKBest) after scaling:

min_max_scaler = preprocessing.MinMaxScaler()
Scaled_X = min_max_scaler.fit_transform(X2)
X_new = SelectKBest(chi2, k=2).fit_transform(Scaled_X, Y)
X_Train3, X_Test3, Y_Train3, Y_Test3 = train_test_split(X_new, Y, test_size=0.30, random_state=101)
trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train3, Y_Train3)

Lasso regression (LassoCV) is used as a regularisation method to shrink coefficients of irrelevant features to zero:

from sklearn.linear_model import LassoCV
regr = LassoCV(cv=5, random_state=101)
regr.fit(X_Train, Y_Train)
print("LassoCV Best Alpha Scored:", regr.alpha_)
print("LassoCV Model Accuracy:", regr.score(X_Test, Y_Test))
model_coef = pd.Series(regr.coef_, index=list(X.columns[:-1]))
print("Variables Eliminated:", sum(model_coef == 0))
print("Variables Kept:", sum(model_coef != 0))

Plots of feature importance from the Lasso model highlight the most influential variables, aiding model interpretability.

The article concludes that a combination of filter, wrapper, and embedded techniques can effectively reduce dimensionality while preserving, or even improving, predictive performance on the mushroom classification task.

machine learningPythondata preprocessingfeature selectionrandom forestscikit-learnMushroom dataset
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.