Top 10 Essential Python Libraries for Data Analysis with Code Examples
This article introduces ten highly practical Python libraries for data analysis—from Pandas and NumPy for data manipulation to Matplotlib, Seaborn, Plotly, Bokeh for visualization, and Scikit‑learn, Prophet, Dask, and PySpark for machine learning and big‑data processing—each illustrated with concise code snippets.
In the field of Python data analysis, mastering core libraries can dramatically boost productivity. This guide selects ten highly useful libraries and provides code examples that walk through the full workflow from data handling to machine learning.
1. Pandas: The all‑rounder for structured data processing
Pandas excels at handling tabular data, offering efficient cleaning and transformation capabilities.
<code># 读取Excel文件并处理缺失值
import pandas as pd
df = pd.read_excel('customer_data.xlsx')
df['age'].fillna(df['age'].median(), inplace=True) # 用中位数填充年龄缺失值
df['register_date'] = pd.to_datetime(df['register_date']) # 数据转换:将日期字符串转为日期格式</code>2. NumPy: The acceleration engine for multi‑dimensional array operations
NumPy provides high‑performance numerical computation, suitable for large‑scale data.
<code>import numpy as np
sales = np.array([1200, 1500, 800, 2000])
commission = sales * 0.05 # 计算5%的佣金
total = np.sum(sales) # 总销售额:5500</code>3. Matplotlib: The Swiss‑army knife for basic charting
Matplotlib can quickly generate line charts, scatter plots, and other basic visualizations.
<code>import matplotlib.pyplot as plt
products = ['A', 'B', 'C']
sales = [120, 150, 90]
plt.bar(products, sales, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.title('Product Sales Comparison')
plt.show()</code>4. Seaborn: The stylish choice for statistical visualizations
Built on Matplotlib, Seaborn produces more attractive statistical charts.
<code>import seaborn as sns
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()</code>5. Plotly: The dynamic expert for interactive charts
Plotly supports interactive visualizations, ideal for dynamic reports.
<code>import plotly.express as px
fig = px.choropleth(df, locations='state', color='sales',
hover_data=['city', 'revenue'],
color_continuous_scale='Viridis')
fig.show()</code>6. Scikit‑learn: The Swiss‑army knife for machine‑learning preprocessing
Scikit‑learn offers tools for data preprocessing and model training.
<code>from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['price', 'advertising']])</code>7. Dask: The parallel pioneer for distributed computing
Dask handles massive datasets and supports distributed computation.
<code>import dask.dataframe as dd
ddf = dd.read_csv('large_sales.csv')
average = ddf.groupby('category')['sales'].mean().compute()</code>8. PySpark: The distributed engine for big‑data analysis
PySpark is suitable for processing huge data volumes with distributed computing.
<code>from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()
df_spark = spark.read.csv('sales_data.csv', header=True, inferSchema=True)
df_spark.orderBy(df_spark['sales'].desc()).show(5)</code>9. Bokeh: The lightweight option for interactive visualizations
Bokeh creates interactive charts that integrate well with web applications.
<code>from bokeh.plotting import figure, show
p = figure(title="Sales vs. Price", x_axis_label='Price', y_axis_label='Sales')
p.circle(df['price'], df['sales'], size=10, color='blue', alpha=0.5)
show(p)</code>10. Prophet: The powerhouse for time‑series forecasting
Prophet excels at handling time‑series data and provides high‑accuracy predictions.
<code>from prophet import Prophet
df_prophet = df[['register_date', 'sales']].rename(columns={'register_date':'ds','sales':'y'})
model = Prophet()
model.fit(df_prophet)
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)
model.plot(forecast)</code>The article concludes with a call to action, offering free Python learning resources via QR code and links to related articles.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.