Calculating Common Classification Evaluation Metrics Using Confusion Matrix with sklearn, TensorFlow, and Manual Methods
This tutorial explains how to compute accuracy, precision, recall, F1‑score, and ROC‑AUC from a confusion matrix using sklearn, TensorFlow, and hand‑crafted Python code, illustrating each metric with example data and visualizations.
Classification Evaluation Metrics
Continuing from the previous article on confusion‑matrix visualization, this article demonstrates how to compute common evaluation metrics—accuracy, precision, recall, F1‑score, and ROC‑AUC—using three approaches: sklearn , TensorFlow , and manual calculations based on a hand‑crafted confusion matrix.
Imports
import numpy as np
import pandas as pd
import sklearn.metrics
import tensorflow as tf
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, accuracy_score, RocCurveDisplayAccuracy
Accuracy measures the proportion of correctly classified samples among all samples; it can be misleading on imbalanced data.
sklearn.metrics.accuracy_score
# set prediction results
pred = [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]
# set true labels
true = [0, 1, 2, 3, 1, 5, 0, 1, 2, 3, 1, 5, 0, 1, 2, 3, 4, 5]
accuracy = sklearn.metrics.accuracy_score(y_true=true, y_pred=pred)
print(accuracy)Output: 0.8888888888888888
tf.keras.metrics.Accuracy
accuracy = tf.keras.metrics.Accuracy()
accuracy.update_state(y_true=true, y_pred=pred)
print(accuracy.result().numpy())Output: 0.8888889
Precision
Precision evaluates, for a specific class, how many predicted positives are truly positive.
sklearn.metrics.precision_score
precision = sklearn.metrics.precision_score(y_true=true, y_pred=pred, average='macro')
print(precision)Output: 0.8888888888888888
tf.keras.metrics.Precision
precision = tf.keras.metrics.Precision()
precision.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(precision.result().numpy())Output: 0.8888889
Recall
Recall (sensitivity) measures, for a specific class, how many actual positives are correctly identified.
sklearn.metrics.recall_score
recall = sklearn.metrics.recall_score(y_true=true, y_pred=pred, average='macro')
print(recall)Output: 0.8888888888888888
tf.keras.metrics.Recall
recall = tf.keras.metrics.Recall()
recall.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(recall.result().numpy())Output: 0.8888889
F1‑Score
F1‑score is the harmonic mean of precision and recall, providing a single measure of a test's accuracy.
sklearn.metrics.f1_score
f1 = sklearn.metrics.f1_score(y_true=true, y_pred=pred, average='macro')
print(f1)Output: 0.875
tf.keras.metrics.F1Score
f1 = tf.keras.metrics.F1Score(average='macro')
f1.update_state(y_true=tf.one_hot(true, 6), y_pred=tf.one_hot(pred, 6))
print(f1.result().numpy())Output: 0.875
Manual Calculations from a Hand‑Crafted Confusion Matrix
The article also shows how to build a confusion matrix with sklearn.metrics.confusion_matrix , extract TP, TN, FP, FN for each class, and compute the metrics manually.
Construct Confusion Matrix
cm = sklearn.metrics.confusion_matrix(y_true=true, y_pred=pred)
print(cm)
# total samples
total = np.sum(cm)
# sum of diagonal (correct predictions)
line = np.sum([cm[i, i] for i in range(len(cm))])
classes_list = []
for i in range(len(cm)):
TP = cm[i, i]
TN = line - TP
FP = sum(cm[:, i]) - TP
FN = total - TP - TN - FP
classes_list.append({i: {'tp': TP, 'tn': TN, 'fp': FP, 'fn': FN}})
print(classes_list)Using the extracted values, the script computes accuracy, precision, recall, and F1 for each class and averages them (shown in the original code snippets).
ROC‑AUC Curve
ROC‑AUC evaluates the trade‑off between true‑positive rate and false‑positive rate across thresholds. The article generates a synthetic multi‑class dataset with make_classification , trains a One‑Vs‑Rest logistic regression model, and plots ROC curves for each class.
Data Generation and Model Training
n_classes = 6
x, y = make_classification(n_samples=1000, n_features=32, n_informative=16, n_classes=n_classes, class_sep=2)
y = label_binarize(y, classes=range(n_classes))
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.3)
model = OneVsRestClassifier(LogisticRegression())
output = model.fit(train_x, train_y).decision_function(valid_x)
pred = model.predict(valid_x)
print('Accuracy:', accuracy_score(valid_y, pred))The ROC curve is plotted by computing fpr , tpr , and auc for each class and using RocCurveDisplay to visualize them.
Plotting ROC Curves
fpr = {}
tpr = {}
roc_auc = {}
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(valid_y[:, i], output[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
fig, ax = plt.subplots()
ax.plot([0, 1], '--', linewidth=2)
for i in range(n_classes):
display = RocCurveDisplay(fpr=fpr[i], tpr=tpr[i], roc_auc=roc_auc[i], estimator_name=f'Class {i}')
display.plot(ax=ax)
plt.title('ROC-AUC')
plt.show()The resulting plot shows each class's ROC curve and its AUC value, illustrating model performance across thresholds.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.