Master Data Visualization: Core Concepts, Chart Selection, and Python Code Samples
This comprehensive guide explains what data visualization is, why it matters, how to choose the right chart type, preprocess data, design effective visuals, select appropriate Python tools, and provides numerous code examples for pie, bar, histogram, box, scatter, bubble, and deviation charts, concluding with best‑practice insights.
What is Data Visualization
Data visualization converts numerical data into visual forms that humans can perceive quickly, enabling rapid identification of patterns, trends, and anomalies. Common visual forms include charts, maps, graphs, dashboards, heatmaps, word clouds, and animated visuals.
Why Data Visualization Matters
Visualization accelerates comprehension because the brain processes images faster than numbers, reveals hidden patterns in large datasets, supports rational decision‑making, and serves as a universal communication language. Anscombe’s Quartet illustrates that identical statistical summaries can correspond to very different shapes, underscoring the need for visual inspection.
How to Create Effective Visualizations
1. Choose the Right Chart Type
Visualization goals fall into four categories: comparison, distribution, relationship/trend, and composition. Recommended chart types:
Comparison : bar, grouped bar, stacked bar.
Distribution : histogram, kernel density plot, box plot, violin plot.
Relationship/Trend : scatter, bubble, line, area, heatmap.
Composition : pie, donut, stacked bar, treemap, radar.
Deviation : error bars, residual plots.
2. Data Pre‑Processing
Before plotting, clean and transform data: merge or sample to simplify structure, apply dimensionality reduction, perform feature selection or generation, and discretize or transform attributes for clearer numeric representation.
3. Design and Interaction
Balance accuracy with aesthetics. Use color contrast to highlight key information and add interactive elements (hover, zoom) when possible.
4. Tool Selection (Python)
Popular libraries:
Matplotlib/Seaborn : foundational, suitable for research and teaching.
Plotly/Bokeh : enable interactive visualizations for sharing.
ECharts, D3.js : front‑end, large‑scale interactive visualizations.
Code Examples
Composition Charts
Pie Chart
import matplotlib.pyplot as plt
labels = ['A', 'B', 'C', 'D']
sizes = [25, 30, 20, 25]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['red', 'lightgreen', 'lightblue', 'yellow'])
plt.title('Composition - Pie Chart')
plt.show()Donut Chart
import matplotlib.pyplot as plt
labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [25, 30, 20, 25]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['red', 'green', 'blue', 'yellow'],
wedgeprops=dict(width=0.3))
plt.title('Donut Chart - Data Composition')
plt.show()Stacked Bar Chart
import matplotlib.pyplot as plt
import numpy as np
categories = ['Category A', 'Category B', 'Category C']
values1 = [25, 30, 20]
values2 = [10, 15, 25]
bar_width = 0.5
index = np.arange(len(categories))
plt.bar(index, values1, width=bar_width, label='Group 1', color='red')
plt.bar(index, values2, width=bar_width, bottom=values1,
label='Group 2', color='orange')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Stacked Bar Chart - Data Composition')
plt.xticks(index, categories)
plt.legend()
plt.show()Distribution Charts
Histogram with Kernel Density Estimate
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30, density=True, color='skyblue', edgecolor='black')
sns.kdeplot(data, color='red')
mean_value = np.mean(data)
std_dev = np.std(data)
plt.axvline(mean_value, color='green', linestyle='dashed',
label=f'Mean: {mean_value:.2f}')
plt.axvline(mean_value + std_dev, color='orange', linestyle='dashed',
label=f'Std Dev: {std_dev:.2f}')
plt.axvline(mean_value - std_dev, color='orange', linestyle='dashed')
plt.title('Histogram and KDE with Mean and Std Dev')
plt.xlabel('Values')
plt.ylabel('Density')
plt.legend()
plt.show()Box Plot
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x=data, color='skyblue')
plt.subplot(1, 2, 2)
sns.boxplot(y=data, color='skyblue')
plt.title('Box Plot - Data Distribution')
plt.show()Comparison Charts
Grouped Horizontal Bar Chart
import matplotlib.pyplot as plt
import numpy as np
categories = ['Category A', 'Category B', 'Category C']
values1 = [4, 7, 3]
values2 = [2, 5, 8]
bar_height = 0.35
index = np.arange(len(categories))
plt.barh(index, values1, height=bar_height, label='Group 1', color='blue')
plt.barh(index + bar_height, values2, height=bar_height,
label='Group 2', color='orange')
plt.ylabel('Categories')
plt.xlabel('Values')
plt.title('Grouped Horizontal Bar Chart')
plt.yticks(index + bar_height/2, categories)
plt.legend()
plt.show()Grouped Vertical Bar Chart
import matplotlib.pyplot as plt
import numpy as np
categories = ['Category A', 'Category B', 'Category C']
values1 = [4, 7, 3]
values2 = [2, 5, 8]
bar_width = 0.35
index = np.arange(len(categories))
plt.bar(index, values1, width=bar_width, label='Group 1', color='blue')
plt.bar(index + bar_width, values2, width=bar_width,
label='Group 2', color='orange')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Grouped Vertical Bar Chart')
plt.xticks(index + bar_width/2, categories)
plt.legend()
plt.show()Relationship & Trend Charts
Bubble Chart
import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(30)
y = np.random.rand(30)
sizes = np.random.rand(30) * 1000
plt.scatter(x, y, s=sizes, alpha=0.7, c='skyblue', edgecolors='black')
plt.title('Bubble Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()Scatter & Line Chart (Trend Over Time)
import matplotlib.pyplot as plt
import numpy as np
time = np.arange(0, 10, 0.1)
data = np.sin(time) + 0.2 * np.random.randn(len(time))
plt.scatter(time, data, color='green')
plt.plot(time, data, color='blue')
plt.plot(time, np.sin(time), color='red')
plt.title('Scatter Plot - Trend Over Time')
plt.xlabel('Time')
plt.ylabel('Values')
plt.show()Deviation Charts
Error Bar Plot
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 4, 6])
error = np.array([0.5, 0.3, 0.2, 0.4, 0.6])
plt.errorbar(x, y, yerr=error, fmt='o', color='blue', capsize=5)
plt.title('Error Bar Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()Residual Plot (Regression)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(100)
y = 2 * x + 1 + 0.1 * np.random.randn(100)
sns.regplot(x=x, y=y, ci=None, line_kws={'color':'red'})
residuals = y - (2 * x + 1)
sns.residplot(x=x, y=residuals, lowess=True, color='blue')
plt.title('Residual Plot')
plt.xlabel('X-axis')
plt.ylabel('Residuals')
plt.show()References
https://mp.weixin.qq.com/s/ffbmojSucQBrlOlRuCJ7qw
https://zhuanlan.zhihu.com/p/657259480
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
