Artificial Intelligence 14 min read

Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

This tutorial explains why and how to reduce the number of features in the Kaggle Mushroom Classification dataset with Python, covering preprocessing, various feature‑selection methods (filter, wrapper, embedded), code examples, model training, performance impact, and visualisation of results.

DataFunTalk

Mar 5, 2021

Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

According to Forbes, about 2.5 million bytes of data are generated daily; before any statistical analysis, the raw data must be pre‑processed. This article provides a plain‑language guide on using Python to reduce the feature count of the Kaggle Mushroom Classification dataset.

Reducing the number of features used during statistical analysis can improve model accuracy, lower over‑fitting risk, speed up training, enhance data visualisation, and increase model interpretability.

Feature‑selection methods fall into three categories: (1) filter methods that select subsets based on statistical measures such as Pearson correlation; (2) wrapper methods that evaluate subsets with a machine‑learning model (e.g., forward/backward/recursive feature elimination); and (3) embedded methods that rank features during model training.

Data loading and preprocessing:

X = df.drop(['class'], axis=1)</code>
<code>Y = df['class']</code>
<code>X = pd.get_dummies(X, prefix_sep='_')</code>
<code>Y = LabelEncoder().fit_transform(Y)</code>
<code>X2 = StandardScaler().fit_transform(X)</code>
<code>X_Train, X_Test, Y_Train, Y_Test = train_test_split(X2, Y, test_size=0.30, random_state=101)

Training a RandomForestClassifier on all features yields about 2.2 seconds of training time and 100 % accuracy. Feature importance is plotted, showing the top seven features.

Using only the three most important features reduces training time by half while decreasing accuracy by only 0.03 %.

A decision‑tree visualisation further confirms that the top features are the most influential for classification.

Recursive Feature Elimination (RFE) :

from sklearn.feature_selection import RFE</code>
<code>model = RandomForestClassifier(n_estimators=700)</code>
<code>rfe = RFE(model, 4)</code>
<code>start = time.process_time()</code>
<code>RFE_X_Train = rfe.fit_transform(X_Train, Y_Train)</code>
<code>RFE_X_Test = rfe.transform(X_Test)</code>
<code>rfe = rfe.fit(RFE_X_Train, Y_Train)</code>
<code>print(time.process_time() - start)</code>
<code>print("Overall Accuracy using RFE:", rfe.score(RFE_X_Test, Y_Test))

SelectFromModel with an ExtraTreesClassifier:

from sklearn.ensemble import ExtraTreesClassifier</code>
<code>from sklearn.feature_selection import SelectFromModel</code>
<code>model = ExtraTreesClassifier()</code>
<code>start = time.process_time()</code>
<code>model = model.fit(X_Train, Y_Train)</code>
<code>model = SelectFromModel(model, prefit=True)</code>
<code>print(time.process_time() - start)</code>
<code>Selected_X = model.transform(X_Train)</code>
<code>trainedforest = RandomForestClassifier(n_estimators=700).fit(Selected_X, Y_Train)</code>
<code>Selected_X_Test = model.transform(X_Test)</code>
<code>predictionforest = trainedforest.predict(Selected_X_Test)</code>
<code>print(confusion_matrix(Y_Test, predictionforest))</code>
<code>print(classification_report(Y_Test, predictionforest))

Correlation‑matrix analysis selects features whose absolute correlation with the target exceeds 0.5:

Numeric_df = pd.DataFrame(X)</code>
<code>Numeric_df['Y'] = Y</code>
<code>corr = Numeric_df.corr()</code>
<code>corr_y = abs(corr["Y"])</code>
<code>highest_corr = corr_y[corr_y > 0.5]</code>
<code>highest_corr.sort_values(ascending=True)

Univariate feature selection using chi2 (SelectKBest) after scaling:

min_max_scaler = preprocessing.MinMaxScaler()</code>
<code>Scaled_X = min_max_scaler.fit_transform(X2)</code>
<code>X_new = SelectKBest(chi2, k=2).fit_transform(Scaled_X, Y)</code>
<code>X_Train3, X_Test3, Y_Train3, Y_Test3 = train_test_split(X_new, Y, test_size=0.30, random_state=101)</code>
<code>trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train3, Y_Train3)

Lasso regression (LassoCV) is used as a regularisation method to shrink coefficients of irrelevant features to zero:

from sklearn.linear_model import LassoCV</code>
<code>regr = LassoCV(cv=5, random_state=101)</code>
<code>regr.fit(X_Train, Y_Train)</code>
<code>print("LassoCV Best Alpha Scored:", regr.alpha_)</code>
<code>print("LassoCV Model Accuracy:", regr.score(X_Test, Y_Test))</code>
<code>model_coef = pd.Series(regr.coef_, index=list(X.columns[:-1]))</code>
<code>print("Variables Eliminated:", sum(model_coef == 0))</code>
<code>print("Variables Kept:", sum(model_coef != 0))

Plots of feature importance from the Lasso model highlight the most influential variables, aiding model interpretability.

The article concludes that a combination of filter, wrapper, and embedded techniques can effectively reduce dimensionality while preserving, or even improving, predictive performance on the mushroom classification task.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python Data preprocessing feature selection Random Forest Scikit-learn Mushroom dataset

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.