Fundamentals 15 min read

Mastering Data Queries: Pandas vs SQL – A Step‑by‑Step Comparison

This tutorial walks data analysts through a side‑by‑side comparison of common data‑manipulation tasks using pandas in Python and SQL, covering everything from basic selects and filters to joins, aggregations, unions, ordering, case expressions, and data updates with clear code examples.

ITPUB
ITPUB
ITPUB
Mastering Data Queries: Pandas vs SQL – A Step‑by‑Step Comparison

As a data analyst, the most common tools are SQL (MySQL, Hive, etc.) and Python's pandas library. This article provides a side‑by‑side comparison of typical data‑manipulation operations on a fictional order dataset, showing equivalent pandas and SQL commands.

Preparation

Use Jupyter Notebook for pandas and load the CSV file:

import pandas as pd
order_data = pd.read_csv('order.csv')

For SQL, run the supplied SQL script to create the t_order table and insert the same data (e.g., with Navicat).

Basic Queries

1. View all rows

pandas: simply print the DataFrame order_data.

SQL:

select * from t_order;

2. View the first N rows

pandas: order_data.head(10) SQL:

select * from t_order limit 10;

Selecting Columns

To retrieve specific columns, pandas can use bracket notation, loc, or iloc. SQL uses a comma‑separated list of column names.

Distinct Values

Count distinct users who placed orders.

pandas: order_data['uid'].unique() (or order_data['uid'].nunique() for the count).

SQL: select distinct uid from t_order; and

select count(distinct uid) from t_order;

Filtering Rows

Single condition

pandas: order_data[order_data['uid'] == 10003] SQL:

select * from t_order where uid = 10003;

Multiple conditions (AND)

pandas:

order_data[(order_data['uid'] == 10003) & (order_data['amount'] > 50)]

SQL:

select * from t_order where uid = 10003 and amount > 50;

Multiple conditions (OR)

pandas:

order_data[(order_data['uid'] == 10003) | (order_data['amount'] > 50)]

SQL:

select * from t_order where uid = 10003 or amount > 50;

Handling NULL values:

# pandas – rows where uid is not null
order_data[order_data['uid'].notna()]
# pandas – rows where uid is null
order_data[order_data['uid'].isna()]

SQL equivalents use where uid is not null and where uid is null.

Aggregation (GROUP BY)

Count orders per user:

pandas:

order_data.groupby('uid').size().reset_index(name='order_count')

SQL:

select uid, count(*) as order_count from t_order group by uid;

Multiple aggregations (order count and total amount):

order_data.groupby('uid').agg({'order_id':'count', 'amount':'sum'}).reset_index()
select uid, count(*) as order_count, sum(amount) as total_amount from t_order group by uid;

Rename result columns: pandas rename(columns={'order_count':'cnt'}), SQL as cnt.

Joins

Load a second CSV user.csv (table t_user) containing user nickname and age.

user_data = pd.read_csv('user.csv')

Left join

pandas:

merged = pd.merge(order_data, user_data, on='uid', how='left')

SQL:

select *
from t_order a left join t_user b on a.uid = b.uid;

Other join types (inner, right, full) use the corresponding how parameter in pandas and the matching SQL keywords.

# pandas inner join
pd.merge(order_data, user_data, on='uid', how='inner')
SELECT *
FROM t_order a
INNER JOIN t_user b ON a.uid = b.uid;

Union

Combine two order tables ( order and order2) vertically.

pandas (no deduplication):

order_union = pd.concat([order_data, order_data2])

SQL (no deduplication):

select * from t_order
union all
select * from t_order2;

For deduplication, pandas adds .drop_duplicates() and SQL uses union (without all).

order_union = pd.concat([order_data, order_data2]).drop_duplicates()
select * from t_order
union
select * from t_order2;

Ordering

Sort by order count descending:

# pandas
result.sort_values('order_count', ascending=False, inplace=True)
SELECT uid, COUNT(*) as order_count
FROM t_order
GROUP BY uid
ORDER BY order_count DESC;

CASE WHEN

Bucket users by total amount:

# pandas using map (alternative: pd.cut)
order_data['amount_bucket'] = order_data['amount'].map(lambda x: '[0-300)' if x < 300 else '[300-600)' if x < 600 else '[600-900)')
SELECT uid,
       CASE
         WHEN total_amount < 300 THEN '[0-300)'
         WHEN total_amount < 600 THEN '[300-600)'
         ELSE '[600-900)'
       END AS amount_bucket
FROM (
  SELECT uid, SUM(amount) AS total_amount
  FROM t_order
  GROUP BY uid
) sub;

Update and Delete

Update ages less than 20 to 20:

# pandas
order_data.loc[order_data['age'] < 20, 'age'] = 20
UPDATE t_user SET age = 20 WHERE age < 20;

Delete rows where age = 30:

# pandas – filter out rows
order_data = order_data[order_data['age'] != 30]
DELETE FROM t_user WHERE age = 30;

Drop a column:

# pandas
order_data.drop(columns=['age'], inplace=True)
ALTER TABLE t_order DROP COLUMN age;

Conclusion

The parallel examples demonstrate that most data‑manipulation tasks—selection, filtering, aggregation, joining, union, ordering, conditional logic, and data modification—can be expressed in both pandas and SQL with comparable syntax. Understanding the correspondence helps analysts switch between the two environments efficiently and choose the tool that best fits the data source and workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLdata analysisTutorialComparisonpandas
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.