Comparative Analysis and Optimization of Machine Learning Models on the UCI Census Income Dataset
This article walks through a complete machine‑learning workflow on the UCI Census Income dataset, covering data exploration, preprocessing (including log‑transformation and scaling), model training with Naïve Bayes, Decision Tree and SVM, performance evaluation, hyper‑parameter tuning via grid search, feature importance analysis, and feature selection, providing code snippets and visualizations.
1. Data Exploration
The dataset is loaded from the UCI Machine Learning Repository and the first few rows are displayed to understand the mix of continuous (e.g., age) and categorical (e.g., workclass) features.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display
import visuals as vs
%matplotlib inline
data = pd.read_csv("census.csv")
display(data.head(n=3))Basic statistics reveal 45,222 records, with 24.78% of individuals earning more than $50,000.
# Count records and income distribution
n_records = data.shape[0]
n_at_most_50k, n_greater_50k = data.income.value_counts()
greater_percent = np.true_divide(n_greater_50k, n_records) * 100
print("Total number of records: {}".format(n_records))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {:.2f}%".format(greater_percent))1.1 Initial Data Exploration
The target variable income is binary (">50K" vs "<=50K").
2. Data Preparation
2.1 Transform Skewed Numerical Features
Features capital-gain and capital-loss exhibit heavy skew; a log transformation (adding 1 to avoid log(0)) is applied.
# Split features and target
income_raw = data['income']
features_raw = data.drop('income', axis=1)
vs.distribution(data) # original distribution
# Log‑transform skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data=features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))
vs.distribution(features_log_transformed, transformed=True)2.2 Continuous Feature Normalization
All continuous features are scaled to the [0, 1] range using MinMaxScaler.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_log_minmax_transform = pd.DataFrame(data=features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])
display(features_log_minmax_transform.head(n=5))
vs.distribution(features_log_minmax_transform)2.3 Data Preprocessing (One‑Hot Encoding)
Categorical attributes are converted to dummy variables with pd.get_dummies. The binary target is encoded as 0/1 using LabelEncoder.
from sklearn.preprocessing import LabelEncoder
features_final = pd.get_dummies(features_log_minmax_transform)
encoder = LabelEncoder()
income = encoder.fit_transform(income_raw)
print("{} total features after one-hot encoding.".format(len(features_final.columns)))2.4 Dataset Splitting
The processed data is split into training (80%) and testing (20%) sets.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final, income, test_size=0.2, random_state=0)
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))3. Model Performance Evaluation
3.1 Metrics and Naïve Classifier
Accuracy and F‑beta (β = 0.5) are used as evaluation metrics. A naïve predictor that always predicts "<=50K" yields an accuracy of 0.2478.
# Naïve predictor metrics
TP = np.sum(income)
FP = len(income) - TP
accuracy = np.true_divide(TP, TP + FP)
recall = 1
precision = accuracy
fscore = (1 + 0.5**2) * (precision * recall) / ((0.5**2 * precision) + recall)
print("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))3.2 Model Selection
Three supervised classifiers are considered: Gaussian Naïve Bayes, Decision Tree, and Support Vector Machine.
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
clf_A = GaussianNB()
clf_B = DecisionTreeClassifier(random_state=0)
clf_C = SVC(kernel='rbf')3.3 Training and Prediction Pipeline
A helper function train_predict trains a learner on a specified sample size, records training and prediction times, and returns accuracy and F‑beta scores for both training and test sets.
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
'''
inputs:
- learner: model to train
- sample_size: number of training samples to use
- X_train, y_train: training data
- X_test, y_test: testing data
'''
results = {}
start = time()
learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
end = time()
results['train_time'] = end - start
start = time()
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
end = time()
results['pred_time'] = end - start
results['acc_train'] = accuracy_score(predictions_train, y_train[:300])
results['acc_test'] = accuracy_score(predictions_test, y_test)
results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
return results3.4 Initial Model Evaluation
The three models are evaluated on 1 %, 10 %, and 100 % of the training data.
# Determine sample sizes
samples_100 = len(y_train)
samples_10 = int(len(y_train) * 0.1)
samples_1 = int(len(y_train) * 0.01)
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_test, y_test)
vs.evaluate(results, accuracy, fscore) # visualisation function from the original notebookObservations: Decision Tree trains fastest and attains high training accuracy but overfits; SVM has longer training time but better generalisation on the test set; Naïve Bayes is the fastest but yields lower accuracy.
4. Improving Results
4.1 Selecting the Best Model
Based on runtime, accuracy, and generalisation, the Decision Tree is chosen as the preferred model.
4.2 Model Hyperparameter Tuning
Grid search is employed to optimise max_depth and criterion for the Decision Tree using an F‑beta scorer.
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score
clf = DecisionTreeClassifier(random_state=0)
parameters = {'max_depth': (2,3,4,5,6), 'criterion': ['gini','entropy']}
scorer = make_scorer(fbeta_score, beta=0.5)
grid_obj = GridSearchCV(clf, parameters, scorer)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
# Compare unoptimized vs optimized model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
print("Unoptimized model
------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta=0.5)))
print("
Optimized Model
------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta=0.5)))The optimized Decision Tree improves test accuracy from 0.8186 to 0.8523 and F‑beta from 0.6279 to 0.7224.
5. Feature Importance
5.1 Extracting Feature Importance
Algorithms that expose a .feature_importances_ attribute (e.g., AdaBoost) are used to rank features.
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(X_train, y_train)
importances = model.feature_importances_
vs.feature_plot(importances, X_train, y_train)The plot shows the top five most influential features.
6. Feature Selection
Only the five highest‑importance features are retained, and the best model (the tuned Decision Tree) is retrained on this reduced set.
from sklearn.base import clone
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]
clf = clone(best_clf).fit(X_train_reduced, y_train)
reduced_predictions = clf.predict(X_test_reduced)
print("Final Model trained on full data
------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta=0.5)))
print("
Final Model trained on reduced data
------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta=0.5)))Using only five features slightly reduces performance (accuracy ≈ 0.828, F‑beta ≈ 0.659) but greatly improves efficiency.
Conclusion
The end‑to‑end workflow demonstrates how to explore, preprocess, evaluate, tune, and simplify machine‑learning models for a binary income‑prediction task, highlighting the trade‑offs between model complexity, runtime, and predictive performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
