Top 25 Pandas Tricks for DataFrame Manipulation and Analysis
This tutorial showcases a comprehensive set of pandas techniques—including reading data from the clipboard, random sampling, multi‑condition filtering, handling missing values, string splitting, list expansion, multi‑function aggregation, slicing, descriptive statistics, categorical conversion, DataFrame styling, and profiling—to efficiently explore and transform DataFrames in Python.
The article presents a collection of practical pandas tricks for working with DataFrames, covering data import, sampling, filtering, missing‑value handling, aggregation, reshaping, styling, and profiling.
Data can be quickly loaded from the clipboard using df.read_clipboard() , though this method is not recommended for reproducible pipelines.
To split a DataFrame into random subsets, use df.sample(frac=0.75) for a 75% sample and df.drop(sample.index) for the remaining rows.
Multiple categorical filters can be expressed with boolean logic or more cleanly with df[df['genre'].isin(['Action', 'Drama', 'Western'])] . The inverse filter uses the tilde operator: df[~df['genre'].isin([...])] .
Identify the most frequent categories by applying df['genre'].value_counts().nlargest(3) and filter with df[df['genre'].isin(top_counts.index)] .
Missing values are detected with df.isna().sum() ; rows or columns can be removed with df.dropna() or by setting a threshold, e.g., df.dropna(thresh=int(len(df)*0.9)) .
String columns can be split into multiple columns using df['name'].str.split(' ', expand=True) , optionally assigning the result back to the original DataFrame.
List‑like Series can be expanded into a DataFrame via df['list_col'].apply(pd.Series) and concatenated with the original using pd.concat([df, new_df], axis=1) .
Aggregate multiple functions with df.groupby('order_id')['item_price'].agg(['sum', 'count']) . To retain the original shape after aggregation, use df.groupby('order_id')['item_price'].transform('sum') and assign the result to a new column.
Row and column slicing is performed with df.loc[row_slice, col_slice] , and descriptive statistics can be limited to a five‑number summary using df.describe().loc[['min','25%','50%','75%','max']] .
Continuous variables can be binned into categories with pd.cut(df['Age'], bins=[0,18,25,99], labels=['child','young adult','adult']) , producing an ordered categorical Series.
DataFrames can be styled for better readability: df.style.format({'Close': '${:,.2f}', 'Volume': '{:,}'}) , highlight minima/maxima, apply background gradients, or render bar charts with df.style.bar() .
For rapid exploratory analysis, the pandas_profiling.ProfileReport(df) function generates an interactive HTML report summarizing overview, variable statistics, correlations, missing‑value analysis, and sample rows.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.