Comprehensive Guide to Python Data Science Libraries with Code Examples
This article presents a concise tutorial on essential Python data science libraries, covering data cleaning with Pandas, numerical analysis with NumPy and SciPy, visualization with Matplotlib and Seaborn, machine learning with scikit‑learn, NLP with NLTK and spaCy, time‑series modeling, image processing, database access, and parallel computing, each illustrated with ready‑to‑run code examples.
1. Data Cleaning and Preprocessing – Pandas is a widely used data manipulation library that provides powerful data structures such as DataFrame and functions for handling missing values, transformations, and aggregations.
import pandas as pd
# 创建一个简单的 DataFrame
data = {'A': [1, 2, None], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
# 处理缺失值
df.fillna(0, inplace=True)2. Data Analysis and Statistics – NumPy offers multi‑dimensional array objects and a suite of mathematical functions, forming the foundation for numerical computing. SciPy builds on NumPy to provide advanced scientific tools such as optimization and linear algebra.
import numpy as np
# 创建一个数组并执行一些基本操作
arr = np.array([1, 2, 3])
mean_value = np.mean(arr) from scipy import stats
# 计算一组数据的描述性统计信息
data = [1, 2, 2, 3, 4]
mode = stats.mode(data)3. Data Visualization – Matplotlib is the most popular Python plotting library for static, animated, and interactive visualizations. Seaborn provides a higher‑level interface focused on statistical graphics.
import matplotlib.pyplot as plt
# 绘制简单折线图
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.show() import seaborn as sns
# 加载示例数据集并绘制热力图
tips = sns.load_dataset("tips")
sns.heatmap(tips.corr(), annot=True)
plt.show()4. Machine Learning and Predictive Modeling – Scikit‑learn is a widely adopted machine‑learning library supporting classification, regression, clustering, feature selection, and model evaluation.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# 分割数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练线性回归模型
model = LinearRegression()
model.fit(X_train, y_train)5. Text Processing and Natural Language Processing (NLP) – NLTK provides a comprehensive toolkit for tokenization, tagging, parsing, and semantic analysis. spaCy is an industrial‑strength NLP library optimized for performance and extensibility.
import nltk
from nltk.tokenize import word_tokenize
# 分词
text = "Hello, how are you?"
tokens = word_tokenize(text)
print(tokens) import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_)6. Time‑Series Analysis – Statsmodels focuses on statistical modeling and time‑series analysis, offering models such as ARIMA and VAR.
import statsmodels.api as sm
# 拟合 ARIMA 模型
model = sm.tsa.ARIMA(endog=data, order=(5,1,0))
results = model.fit()7. Image Processing – OpenCV is an open‑source computer‑vision library widely used for image manipulation and video capture.
import cv2
# 读取图像文件
img = cv2.imread('image.jpg')
# 显示图像
cv2.imshow('Image', img)
cv2.waitKey(0)
cv2.destroyAllWindows()8. Database Connection and SQL Queries – SQLite3 enables direct interaction with SQLite databases. SQLAlchemy provides an ORM layer that maps Python objects to database tables, simplifying database operations.
import sqlite3
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
# 执行查询
cursor.execute("SELECT * FROM table_name")
rows = cursor.fetchall()
for row in rows:
print(row)
conn.close() from sqlalchemy import create_engine, Table, MetaData
engine = create_engine('sqlite:///example.db')
metadata = MetaData(bind=engine)
table = Table('table_name', metadata, autoload_with=engine)
with engine.connect() as connection:
result = connection.execute(table.select())
for row in result:
print(row)9. Parallel and Distributed Computing – The multiprocessing module offers cross‑platform multi‑process programming. Dask extends existing Python code to run on large clusters with parallel execution.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3])) import dask.dataframe as dd
# 读取大型 CSV 文件
df = dd.read_csv('large_file.csv')
result = df.groupby('column').sum().compute()Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.