Fundamentals 25 min read

Master Essential Data Visualization Techniques for Data Science

This article presents a comprehensive collection of practical data visualization methods—including KS plots, SHAP explanations, Q‑Q plots, cumulative variance, Gini vs Entropy, bias‑variance tradeoff, ROC and precision‑recall curves, and elbow analysis—each illustrated with Python code and clear explanations to help analysts and non‑experts quickly interpret complex datasets.

Model Perspective
Model Perspective
Model Perspective
Master Essential Data Visualization Techniques for Data Science
In data science, data visualization is a core tool that enables deep understanding, exploration, and explanation of complex datasets, helping both experts and non‑experts quickly grasp key trends and patterns.

KS Plot

<code>import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate random data and a normal distribution for demonstration purposes
np.random.seed(0)
data = np.random.normal(0, 1, 1000)
theoretical = np.sort(np.random.normal(0, 1, 1000))

# Compute KS statistic
ks_statistic, p_value = stats.ks_2samp(data, theoretical)

# Plot KS Plot
plt.figure(figsize=(10, 6))
plt.plot(np.sort(data), np.linspace(0, 1, len(data), endpoint=False), label='Data CDF')
plt.plot(theoretical, np.linspace(0, 1, len(theoretical), endpoint=False), label='Theoretical CDF', linestyle='--')
plt.title(f"KS Plot (KS Statistic = {ks_statistic:.2f})")
plt.legend()
plt.xlabel("Value")
plt.ylabel("CDF")
plt.grid(True)
plt.show()</code>

This KS plot shows the empirical CDF of the data (blue solid line) against the theoretical normal CDF (yellow dashed line). The KS statistic (~0.03) quantifies the maximum difference between the two distributions, useful for testing goodness‑of‑fit.

SHAP

To demonstrate SHAP explanations, a standard dataset and an XGBoost model are used.

<code>import xgboost
import shap

# train an XGBoost model
X, y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)

# explain the model's predictions using SHAP
explainer = shap.Explainer(model)
shap_values = explainer(X)

# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
</code>

The waterfall plot shows how each feature pushes the model output from the base value (average prediction) toward the final prediction; red features increase the prediction, blue features decrease it.

By rotating many such force plots and stacking them, we can see a global view of the dataset's explanations.

<code># Visualize all training set predictions
shap.plots.force(shap_values)
</code>

To understand the effect of a single feature, a dependence plot can be drawn. For example, the SHAP values for the "RM" feature (average number of rooms) are plotted against the feature values, colored by another feature ("RAD") to highlight interactions.

<code># Create a dependence scatter plot for a single feature
shap.plots.scatter(shap_values[:, "RM"], color=shap_values)
</code>

A summary beeswarm plot shows the distribution of SHAP values for all features across all samples, revealing which features have the strongest impact.

<code># Summarize the impact of all features
shap.plots.beeswarm(shap_values)
</code>

Q‑Q Plot

<code># Generate QQ Plot using the random data for KS plot
plt.figure(figsize=(10, 6))
stats.probplot(np.random.normal(0, 1, 1000), dist="norm", plot=plt)
plt.title("QQ Plot for Randomly Generated Normal Data")
plt.grid(True)
plt.show()
</code>

The Q‑Q plot compares the quantiles of the generated data with those of a theoretical normal distribution; points closely follow the red reference line, indicating the data are approximately normal.

Cumulative Explained Variance Plot

<code>from sklearn.decomposition import PCA

# Use PCA on the Boston dataset
pca = PCA()
X_pca = pca.fit_transform(X)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot Cumulative Explained Variance Plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_variance)+1), cumulative_variance, marker='o', linestyle='--')
plt.title("Cumulative Explained Variance Plot")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()
</code>

The plot shows how the cumulative variance increases as more principal components are added; for example, the first five components capture roughly 80 % of the total variance, guiding the choice of how many components to retain.

Gini Impurity vs. Entropy

These two metrics are commonly used as split criteria in decision‑tree algorithms. Both reach 0 for pure nodes and attain their maximum at a probability of 0.5 for binary classification.

<code># Calculate Gini Impurity and Entropy for a range of probability values
probabilities = np.linspace(0, 1, 100)
gini = [1 - p**2 - (1-p)**2 for p in probabilities]
entropy = [-p*np.log2(p) - (1-p)*np.log2(1-p) if p != 0 and p != 1 else 0 for p in probabilities]

# Plot Gini vs Entropy
plt.figure(figsize=(10, 6))
plt.plot(probabilities, gini, label='Gini Impurity', color='blue')
plt.plot(probabilities, entropy, label='Entropy', color='red', linestyle='--')
plt.title("Gini Impurity vs. Entropy")
plt.xlabel("Probability")
plt.ylabel("Impurity/Entropy Value")
plt.legend()
plt.grid(True)
plt.show()
</code>

Both curves peak at probability 0.5, indicating maximum impurity; the choice between Gini and Entropy depends on the specific application.

Bias‑Variance Tradeoff

The tradeoff explains how total error decomposes into bias, variance, and irreducible error. Increasing model complexity typically reduces bias but raises variance.

<code># Simulating the Bias-Variance Tradeoff

# Model complexity range (for the sake of this example)
complexity = np.linspace(0, 1, 200)

# Simulated bias and variance values (just for visualization)
bias_squared = (1 - complexity)**2.5
variance = complexity**2.5

# Total error is the sum of bias^2, variance, and some irreducible error
total_error = bias_squared + variance + 0.2

# Plotting the Bias-Variance tradeoff
plt.figure(figsize=(10, 6))
plt.plot(complexity, bias_squared, label='Bias^2', color='blue')
plt.plot(complexity, variance, label='Variance', color='red')
plt.plot(complexity, total_error, label='Total Error', color='green')
plt.xlabel('Model Complexity')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True)
plt.show()
</code>

The green curve (total error) reaches a minimum at an intermediate complexity, illustrating the optimal balance point.

ROC Curve

The Receiver Operating Characteristic (ROC) curve evaluates classification performance by plotting the true‑positive rate against the false‑positive rate.

<code>from sklearn.datasets import load_iris
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc

# Load iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Binarize the output labels for multi‑class ROC curve
y_iris_bin = label_binarize(y_iris, classes=[0, 1, 2])
n_classes = y_iris_bin.shape[1]

# Split the data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris_bin, test_size=0.5, random_state=42)

# Train a One‑vs‑Rest logistic regression model
clf_iris = OneVsRestClassifier(LogisticRegression(max_iter=10000))
y_score_iris = clf_iris.fit(X_train_iris, y_train_iris).decision_function(X_test_iris)

# Compute ROC curve for each class
fpr_iris = {}
tpr_iris = {}
roc_auc_iris = {}
for i in range(n_classes):
    fpr_iris[i], tpr_iris[i], _ = roc_curve(y_test_iris[:, i], y_score_iris[:, i])
    roc_auc_iris[i] = auc(fpr_iris[i], tpr_iris[i])

# Plot the ROC curve for each class
plt.figure(figsize=(10, 6))
colors = ['blue', 'red', 'green']
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr_iris[i], tpr_iris[i], color=color,
             label=f'ROC curve for class {i} (area = {roc_auc_iris[i]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve for Iris Dataset')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
</code>

Each curve corresponds to one iris class; the area under the curve (AUC) quantifies performance, with values closer to 1 indicating better discrimination.

Precision‑Recall Curve

Precision‑Recall curves are especially useful for imbalanced datasets, focusing on the positive class.

<code>from sklearn.metrics import precision_recall_curve, average_precision_score

precision_iris = {}
recall_iris = {}
average_precision_iris = {}
for i in range(n_classes):
    precision_iris[i], recall_iris[i], _ = precision_recall_curve(y_test_iris[:, i], y_score_iris[:, i])
    average_precision_iris[i] = average_precision_score(y_test_iris[:, i], y_score_iris[:, i])

plt.figure(figsize=(10, 6))
colors = ['blue', 'red', 'green']
for i, color in zip(range(n_classes), colors):
    plt.plot(recall_iris[i], precision_iris[i], color=color,
             label=f'Precision‑Recall curve for class {i} (average precision = {average_precision_iris[i]:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision‑Recall Curve for Iris Dataset')
plt.legend(loc="upper right")
plt.grid(True)
plt.show()
</code>

The curves illustrate the trade‑off between precision and recall for each class; higher average precision indicates better performance on imbalanced data.

Elbow Curve

The elbow method helps determine the optimal number of clusters for K‑Means.

<code>from sklearn.cluster import KMeans

# Compute the sum of squared distances for different numbers of clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X_iris)
    wcss.append(kmeans.inertia_)

# Plot the Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Curve for KMeans Clustering on Iris Dataset')
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True)
plt.show()
</code>

The plot shows a clear “elbow” around 2–3 clusters, suggesting that this range balances compactness and simplicity.

Other Common Visualization Techniques

Additional useful plots include histograms, box plots, scatter plots, stacked bar charts, heatmaps, radar/spider charts, geographic maps, time‑series charts, violin plots, pair plots, treemaps, donut charts, and word clouds. Each serves a specific purpose for revealing data distribution, relationships, or hierarchical structure.

<code>import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

np.random.seed(42)
data = np.random.randn(1000)
data2 = np.random.randn(1000) + 5
time_series_data = np.cumsum(np.random.randn(1000))
categories = np.random.choice(['A', 'B', 'C'], size=1000)
categories_2 = np.random.choice(['D', 'E', 'F'], size=1000)

df = pd.DataFrame({
    'data': data,
    'data2': data2,
    'time': np.arange(1000),
    'categories': categories,
    'categories_2': categories_2
})

fig, axs = plt.subplots(3, 3, figsize=(15, 15))
# Histogram
axs[0, 0].hist(data, bins=30, color='skyblue', edgecolor='black')
axs[0, 0].set_title('Histogram')
# Boxplot
sns.boxplot(data=[data, data2], ax=axs[0, 1])
axs[0, 1].set_title('Box Plot')
# Scatter Plot
axs[0, 2].scatter(data, data2, alpha=0.6)
axs[0, 2].set_title('Scatter Plot')
# Stacked Bar Plot
df_group = df.groupby('categories')['data'].mean()
df_group.plot(kind='bar', stacked=True, ax=axs[1, 0])
axs[1, 0].set_title('Stacked Bar Chart')
# Heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', ax=axs[1, 1])
axs[1, 1].set_title('Heatmap')
# Radar/Spider Chart
from math import pi
categories_list = list(df_group.index)
N = len(categories_list)
values = df_group.values.flatten().tolist()
values += values[:1]
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
axs[1, 2].plot(angles, values, linewidth=2, linestyle='solid')
axs[1, 2].fill(angles, values, alpha=0.4)
axs[1, 2].set_xticks(angles[:-1])
axs[1, 2].set_xticklabels(categories_list)
axs[1, 2].set_title('Radar/Spider Chart')
# Time Series
axs[2, 0].plot(time_series_data)
axs[2, 0].set_title('Time Series')
# Violin Plot
sns.violinplot(x='categories', y='data', data=df, ax=axs[2, 1])
axs[2, 1].set_title('Violin Plot')
# Remove empty subplot
fig.delaxes(axs[2, 2])
plt.tight_layout()
plt.show()
</code>

Effective data visualization is the cornerstone of data science; selecting the appropriate visual method enhances interpretation and communication of analytical findings.

References: [1] Daily Dose of Data Science. 250+ Python & Data Science Posts. DailyDoseofDS.com. [2] https://github.com/shap/shap

machine learningPythonstatisticsdata visualizationPlotting
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.