Comprehensive Pandas Cheat Sheet: Common Operations, Data Cleaning, and Visualization
This article presents a thorough collection of frequently used pandas commands for importing data, preprocessing, cleaning, transforming, analyzing, and visualizing datasets, complete with code examples and explanations to help Python developers perform efficient data analysis tasks.
This guide provides a comprehensive collection of common pandas operations for data processing, analysis, cleaning, and visualization in Python.
Import dependencies – shows how to import pandas, NumPy, matplotlib, seaborn, SQLAlchemy, and configure display settings for Chinese characters and Retina screens.
# 导入模块
import pymysql
import pandas as pd
import numpy as np
import time
# 数据库
from sqlalchemy import create_engine
# 可视化
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
import seaborn as sns
%matplotlib inline
import pyecharts
import warnings
warnings.filterwarnings("ignore")Algorithm‑related dependencies – imports common scikit‑learn utilities for scaling, clustering, regression, and model evaluation.
# 数据归一化
from sklearn.preprocessing import MinMaxScaler
# kmeans聚类
from sklearn.cluster import KMeans
# DBSCAN聚类
from sklearn.cluster import DBSCAN
# 线性回归
from sklearn.linear_model import LinearRegression
# 逻辑回归
from sklearn.linear_model import LogisticRegression
# 高斯贝叶斯
from sklearn.naive_bayes import GaussianNB
# 划分训练/测试集
from sklearn.model_selection import train_test_split
# 准确度报告
from sklearn import metrics
from sklearn.metrics import classification_report, mean_squared_errorData acquisition – demonstrates creating a SQLAlchemy engine and reading tables into a DataFrame.
engine = create_engine('mysql+pymysql://root:[email protected]:3306/ry?charset=utf8')
result_query_sql = "use information_schema;"
engine.execute(result_query_sql)
result_query_sql = "SELECT table_name,table_rows FROM tables WHERE TABLE_NAME LIKE 'log%%' order by table_rows desc;"
df_result = pd.read_sql(result_query_sql, engine)DataFrame creation and manipulation – converting lists, dictionaries, and NumPy arrays to DataFrames, renaming columns, adding new columns, and type conversion.
# list转df
df_result = pd.DataFrame(pred, columns=['pred'])
df_result['actual'] = test_target
# dict生成df
df_test = pd.DataFrame({'A':[0.587221, 0.135673, 0.135673, 0.135673, 0.135673],
'B':['a','b','c','d','e'],
'C':[1,2,3,4,5]})
# 重命名列
data_scaled = data_scaled.rename(columns={'本体油位':'OILLV'})
# 增加列
bins = [0,5000,20000,50000]
group_names = ['低','中','高']
df['categories'] = pd.cut(df['salary'], bins, labels=group_names)Missing‑value handling – checking for nulls, counting missing values per column, and filling them with mode or median.
# 检查缺失值
has_null = df.isnull().values.any()
# 每列缺失值计数
null_counts = df.isnull().sum()
# 众数填充
heart_df['Thal'].fillna(heart_df['Thal'].mode(dropna=True)[0], inplace=True)
# 均值/中位数填充
for col in dfcolumns:
if heart_df_encoded[col].dtype == 'float':
heart_df_encoded[col].fillna(heart_df_encoded[col].median(), inplace=True)One‑hot encoding, value replacement, and column deletion illustrate using pd.get_dummies, replace, and drop methods.
df_encoded = pd.get_dummies(df_data)
num_encode = {'AHD':{'No':0, 'Yes':1}}
heart_df.replace(num_encode, inplace=True)
df_jj2.drop(['coll_time','polar','conn_type','phase','id','Unnamed: 0'], axis=1, inplace=True)Data selection, filtering, sorting, grouping, and aggregation cover iloc, loc, boolean indexing, sort_values, groupby, pivot_table, and agg patterns.
# 取第33行
row33 = df.iloc[32]
# 条件筛选
filtered = df[(df['col']>0.5) & (df['col']<0.7)]
# 排序
df.sort_values(['col1','col2'], ascending=[True, False], inplace=True)
# 分组聚合
group_mean = df.groupby('col1')['col2'].mean()
# 透视表
pivot = pd.pivot_table(df, values=['salary','score'], index='positionId')Statistical summary and correlation – using describe, mean, std, and corr to explore data distributions.
df.describe()
df.mean()
df.corr()Visualization – line plot, scatter plot, horizontal bar chart, and heatmap with matplotlib and seaborn.
# 折线图
fig, ax = plt.subplots()
df.plot(legend=True, ax=ax)
plt.legend(loc=1)
plt.show()
# 散点图
plt.scatter(df[:,0], df[:,1], c='red', marker='o', label='label0')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=2)
plt.show()
# 横向柱形图
df.nlargest(10).plot(kind='barh')
# 热力图
df_corr = combine.corr()
plt.figure(figsize=(20,20))
sns.heatmap(df_corr, annot=True, cmap='RdYlGn')String manipulation functions – a catalog of 66 frequently used pandas string methods such as cat, contains, startswith, endswith, count, get, len, upper, lower, pad, repeat, slice_replace, replace, split, strip, findall, extract, and extractall, each illustrated with concise examples.
The original article also includes several illustrative images and a link to the source page for further reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
