Artificial Intelligence 24 min read

Comparative Analysis and Optimization of Machine Learning Models on the UCI Census Income Dataset

This article walks through a complete machine‑learning workflow on the UCI Census Income dataset, covering data exploration, preprocessing (including log‑transformation and scaling), model training with Naïve Bayes, Decision Tree and SVM, performance evaluation, hyper‑parameter tuning via grid search, feature importance analysis, and feature selection, providing code snippets and visualizations.

Architecture Digest

Feb 14, 2018

Comparative Analysis and Optimization of Machine Learning Models on the UCI Census Income Dataset

1. Data Exploration

The dataset is loaded from the UCI Machine Learning Repository and the first few rows are displayed to understand the mix of continuous (e.g., age) and categorical (e.g., workclass) features.

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display
import visuals as vs
%matplotlib inline

data = pd.read_csv("census.csv")
display(data.head(n=3))

Basic statistics reveal 45,222 records, with 24.78% of individuals earning more than $50,000.

# Count records and income distribution
n_records = data.shape[0]
n_at_most_50k, n_greater_50k = data.income.value_counts()
greater_percent = np.true_divide(n_greater_50k, n_records) * 100
print("Total number of records: {}".format(n_records))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {:.2f}%".format(greater_percent))

1.1 Initial Data Exploration

The target variable income is binary (">50K" vs "<=50K").

2. Data Preparation

2.1 Transform Skewed Numerical Features

Features capital-gain and capital-loss exhibit heavy skew; a log transformation (adding 1 to avoid log(0)) is applied.

# Split features and target
income_raw = data['income']
features_raw = data.drop('income', axis=1)
vs.distribution(data)  # original distribution

# Log‑transform skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data=features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))
vs.distribution(features_log_transformed, transformed=True)

2.2 Continuous Feature Normalization

All continuous features are scaled to the [0, 1] range using MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_log_minmax_transform = pd.DataFrame(data=features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])
display(features_log_minmax_transform.head(n=5))
vs.distribution(features_log_minmax_transform)

2.3 Data Preprocessing (One‑Hot Encoding)

Categorical attributes are converted to dummy variables with pd.get_dummies. The binary target is encoded as 0/1 using LabelEncoder.

from sklearn.preprocessing import LabelEncoder
features_final = pd.get_dummies(features_log_minmax_transform)
encoder = LabelEncoder()
income = encoder.fit_transform(income_raw)
print("{} total features after one-hot encoding.".format(len(features_final.columns)))

2.4 Dataset Splitting

The processed data is split into training (80%) and testing (20%) sets.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final, income, test_size=0.2, random_state=0)
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

3. Model Performance Evaluation

3.1 Metrics and Naïve Classifier

Accuracy and F‑beta (β = 0.5) are used as evaluation metrics. A naïve predictor that always predicts "<=50K" yields an accuracy of 0.2478.

# Naïve predictor metrics
TP = np.sum(income)
FP = len(income) - TP
accuracy = np.true_divide(TP, TP + FP)
recall = 1
precision = accuracy
fscore = (1 + 0.5**2) * (precision * recall) / ((0.5**2 * precision) + recall)
print("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))

3.2 Model Selection

Three supervised classifiers are considered: Gaussian Naïve Bayes, Decision Tree, and Support Vector Machine.

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

clf_A = GaussianNB()
clf_B = DecisionTreeClassifier(random_state=0)
clf_C = SVC(kernel='rbf')

3.3 Training and Prediction Pipeline

A helper function train_predict trains a learner on a specified sample size, records training and prediction times, and returns accuracy and F‑beta scores for both training and test sets.

from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
    '''
    inputs:
        - learner: model to train
        - sample_size: number of training samples to use
        - X_train, y_train: training data
        - X_test, y_test: testing data
    '''
    results = {}
    start = time()
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time()
    results['train_time'] = end - start
    start = time()
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time()
    results['pred_time'] = end - start
    results['acc_train'] = accuracy_score(predictions_train, y_train[:300])
    results['acc_test'] = accuracy_score(predictions_test, y_test)
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
    return results

3.4 Initial Model Evaluation

The three models are evaluated on 1 %, 10 %, and 100 % of the training data.

# Determine sample sizes
samples_100 = len(y_train)
samples_10 = int(len(y_train) * 0.1)
samples_1 = int(len(y_train) * 0.01)

results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_test, y_test)

vs.evaluate(results, accuracy, fscore)  # visualisation function from the original notebook

Observations: Decision Tree trains fastest and attains high training accuracy but overfits; SVM has longer training time but better generalisation on the test set; Naïve Bayes is the fastest but yields lower accuracy.

4. Improving Results

4.1 Selecting the Best Model

Based on runtime, accuracy, and generalisation, the Decision Tree is chosen as the preferred model.

4.2 Model Hyperparameter Tuning

Grid search is employed to optimise max_depth and criterion for the Decision Tree using an F‑beta scorer.

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score

clf = DecisionTreeClassifier(random_state=0)
parameters = {'max_depth': (2,3,4,5,6), 'criterion': ['gini','entropy']}
scorer = make_scorer(fbeta_score, beta=0.5)
grid_obj = GridSearchCV(clf, parameters, scorer)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_

# Compare unoptimized vs optimized model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
print("Unoptimized model
------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta=0.5)))
print("
Optimized Model
------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta=0.5)))

The optimized Decision Tree improves test accuracy from 0.8186 to 0.8523 and F‑beta from 0.6279 to 0.7224.

5. Feature Importance

5.1 Extracting Feature Importance

Algorithms that expose a .feature_importances_ attribute (e.g., AdaBoost) are used to rank features.

from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(X_train, y_train)
importances = model.feature_importances_
vs.feature_plot(importances, X_train, y_train)

The plot shows the top five most influential features.

6. Feature Selection

Only the five highest‑importance features are retained, and the best model (the tuned Decision Tree) is retrained on this reduced set.

from sklearn.base import clone
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]
clf = clone(best_clf).fit(X_train_reduced, y_train)
reduced_predictions = clf.predict(X_test_reduced)
print("Final Model trained on full data
------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta=0.5)))
print("
Final Model trained on reduced data
------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta=0.5)))

Using only five features slightly reduces performance (accuracy ≈ 0.828, F‑beta ≈ 0.659) but greatly improves efficiency.

Conclusion

The end‑to‑end workflow demonstrates how to explore, preprocess, evaluate, tune, and simplify machine‑learning models for a binary income‑prediction task, highlighting the trade‑offs between model complexity, runtime, and predictive performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python model evaluation Data preprocessing feature selection

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.