How to Remove Duplicate Rows in SQL Using DISTINCT, GROUP BY, and ROW_NUMBER
This guide explains three SQL techniques—DISTINCT, GROUP BY, and the ROW_NUMBER window function—for deduplicating records, compares their behavior and performance, and provides concrete query examples with a sample task table and additional test cases.
Problem Overview
When extracting data with SQL, duplicate rows often appear. For metrics such as unique visitors (UV) you need to remove these duplicates. The example uses a Task table with columns task_id, order_id and start_time. Because a task can correspond to multiple orders, task_id is not unique.
1. DISTINCT
-- List all unique task_id values
SELECT DISTINCT task_id
FROM Task;
-- Count distinct task_id values
SELECT COUNT(DISTINCT task_id) AS task_num
FROM Task;Characteristics
Removes duplicate rows across all selected columns.
Usually slower than other methods.
Commonly combined with COUNT to obtain the number of unique records.
2. GROUP BY
-- List unique task_id values (NULL is treated as a value)
SELECT task_id
FROM Task
GROUP BY task_id;
-- Count distinct task_id using a sub‑query
SELECT COUNT(task_id) AS task_num
FROM (
SELECT task_id
FROM Task
GROUP BY task_id
) tmp;Characteristics
Deduplicates only the columns listed in the GROUP BY clause.
Other columns in the SELECT list must be aggregated or omitted.
3. ROW_NUMBER (Window Function)
-- Keep the first row per task_id (ordered by start_time)
SELECT COUNT(CASE WHEN rn = 1 THEN task_id END) AS task_num
FROM (
SELECT task_id,
ROW_NUMBER() OVER (PARTITION BY task_id ORDER BY start_time) AS rn
FROM Task
) tmp;Characteristics
Requires a database that supports window functions (e.g., Hive, Oracle, PostgreSQL, SQL Server).
The PARTITION BY clause groups rows by task_id; ORDER BY determines the row order inside each group.
Often more efficient than DISTINCT on large datasets because it avoids full‑table deduplication.
4. Comparison Using a Test Table
-- DISTINCT examples
SELECT DISTINCT user_id FROM Test; -- returns 1, 2
SELECT DISTINCT user_id, user_type FROM Test; -- returns (1,1), (1,2), (2,1)
-- GROUP BY examples
SELECT user_id FROM Test GROUP BY user_id; -- returns 1, 2
SELECT user_id, user_type FROM Test GROUP BY user_id, user_type; -- returns (1,1), (1,2), (2,1)The examples illustrate that DISTINCT removes duplicate rows across all selected columns, while GROUP BY deduplicates only the columns listed after it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
