Mastering Data Queries: Pandas vs SQL – A Step‑by‑Step Comparison
This tutorial walks data analysts through a side‑by‑side comparison of common data‑manipulation tasks using pandas in Python and SQL, covering everything from basic selects and filters to joins, aggregations, unions, ordering, case expressions, and data updates with clear code examples.
As a data analyst, the most common tools are SQL (MySQL, Hive, etc.) and Python's pandas library. This article provides a side‑by‑side comparison of typical data‑manipulation operations on a fictional order dataset, showing equivalent pandas and SQL commands.
Preparation
Use Jupyter Notebook for pandas and load the CSV file:
import pandas as pd
order_data = pd.read_csv('order.csv')For SQL, run the supplied SQL script to create the t_order table and insert the same data (e.g., with Navicat).
Basic Queries
1. View all rows
pandas: simply print the DataFrame order_data.
SQL:
select * from t_order;2. View the first N rows
pandas: order_data.head(10) SQL:
select * from t_order limit 10;Selecting Columns
To retrieve specific columns, pandas can use bracket notation, loc, or iloc. SQL uses a comma‑separated list of column names.
Distinct Values
Count distinct users who placed orders.
pandas: order_data['uid'].unique() (or order_data['uid'].nunique() for the count).
SQL: select distinct uid from t_order; and
select count(distinct uid) from t_order;Filtering Rows
Single condition
pandas: order_data[order_data['uid'] == 10003] SQL:
select * from t_order where uid = 10003;Multiple conditions (AND)
pandas:
order_data[(order_data['uid'] == 10003) & (order_data['amount'] > 50)]SQL:
select * from t_order where uid = 10003 and amount > 50;Multiple conditions (OR)
pandas:
order_data[(order_data['uid'] == 10003) | (order_data['amount'] > 50)]SQL:
select * from t_order where uid = 10003 or amount > 50;Handling NULL values:
# pandas – rows where uid is not null
order_data[order_data['uid'].notna()]
# pandas – rows where uid is null
order_data[order_data['uid'].isna()]SQL equivalents use where uid is not null and where uid is null.
Aggregation (GROUP BY)
Count orders per user:
pandas:
order_data.groupby('uid').size().reset_index(name='order_count')SQL:
select uid, count(*) as order_count from t_order group by uid;Multiple aggregations (order count and total amount):
order_data.groupby('uid').agg({'order_id':'count', 'amount':'sum'}).reset_index() select uid, count(*) as order_count, sum(amount) as total_amount from t_order group by uid;Rename result columns: pandas rename(columns={'order_count':'cnt'}), SQL as cnt.
Joins
Load a second CSV user.csv (table t_user) containing user nickname and age.
user_data = pd.read_csv('user.csv')Left join
pandas:
merged = pd.merge(order_data, user_data, on='uid', how='left')SQL:
select *
from t_order a left join t_user b on a.uid = b.uid;Other join types (inner, right, full) use the corresponding how parameter in pandas and the matching SQL keywords.
# pandas inner join
pd.merge(order_data, user_data, on='uid', how='inner') SELECT *
FROM t_order a
INNER JOIN t_user b ON a.uid = b.uid;Union
Combine two order tables ( order and order2) vertically.
pandas (no deduplication):
order_union = pd.concat([order_data, order_data2])SQL (no deduplication):
select * from t_order
union all
select * from t_order2;For deduplication, pandas adds .drop_duplicates() and SQL uses union (without all).
order_union = pd.concat([order_data, order_data2]).drop_duplicates() select * from t_order
union
select * from t_order2;Ordering
Sort by order count descending:
# pandas
result.sort_values('order_count', ascending=False, inplace=True) SELECT uid, COUNT(*) as order_count
FROM t_order
GROUP BY uid
ORDER BY order_count DESC;CASE WHEN
Bucket users by total amount:
# pandas using map (alternative: pd.cut)
order_data['amount_bucket'] = order_data['amount'].map(lambda x: '[0-300)' if x < 300 else '[300-600)' if x < 600 else '[600-900)') SELECT uid,
CASE
WHEN total_amount < 300 THEN '[0-300)'
WHEN total_amount < 600 THEN '[300-600)'
ELSE '[600-900)'
END AS amount_bucket
FROM (
SELECT uid, SUM(amount) AS total_amount
FROM t_order
GROUP BY uid
) sub;Update and Delete
Update ages less than 20 to 20:
# pandas
order_data.loc[order_data['age'] < 20, 'age'] = 20 UPDATE t_user SET age = 20 WHERE age < 20;Delete rows where age = 30:
# pandas – filter out rows
order_data = order_data[order_data['age'] != 30] DELETE FROM t_user WHERE age = 30;Drop a column:
# pandas
order_data.drop(columns=['age'], inplace=True) ALTER TABLE t_order DROP COLUMN age;Conclusion
The parallel examples demonstrate that most data‑manipulation tasks—selection, filtering, aggregation, joining, union, ordering, conditional logic, and data modification—can be expressed in both pandas and SQL with comparable syntax. Understanding the correspondence helps analysts switch between the two environments efficiently and choose the tool that best fits the data source and workflow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
