Databases 5 min read

How to Remove Duplicate Rows in SQL Using DISTINCT, GROUP BY, and ROW_NUMBER

This guide explains three SQL techniques—DISTINCT, GROUP BY, and the ROW_NUMBER window function—for deduplicating records, compares their behavior and performance, and provides concrete query examples with a sample task table and additional test cases.

Liangxu Linux
Liangxu Linux
Liangxu Linux
How to Remove Duplicate Rows in SQL Using DISTINCT, GROUP BY, and ROW_NUMBER

Problem Overview

When extracting data with SQL, duplicate rows often appear. For metrics such as unique visitors (UV) you need to remove these duplicates. The example uses a Task table with columns task_id, order_id and start_time. Because a task can correspond to multiple orders, task_id is not unique.

Task table schema
Task table schema

1. DISTINCT

-- List all unique task_id values
SELECT DISTINCT task_id
FROM Task;

-- Count distinct task_id values
SELECT COUNT(DISTINCT task_id) AS task_num
FROM Task;

Characteristics

Removes duplicate rows across all selected columns.

Usually slower than other methods.

Commonly combined with COUNT to obtain the number of unique records.

2. GROUP BY

-- List unique task_id values (NULL is treated as a value)
SELECT task_id
FROM Task
GROUP BY task_id;

-- Count distinct task_id using a sub‑query
SELECT COUNT(task_id) AS task_num
FROM (
    SELECT task_id
    FROM Task
    GROUP BY task_id
) tmp;

Characteristics

Deduplicates only the columns listed in the GROUP BY clause.

Other columns in the SELECT list must be aggregated or omitted.

3. ROW_NUMBER (Window Function)

-- Keep the first row per task_id (ordered by start_time)
SELECT COUNT(CASE WHEN rn = 1 THEN task_id END) AS task_num
FROM (
    SELECT task_id,
           ROW_NUMBER() OVER (PARTITION BY task_id ORDER BY start_time) AS rn
    FROM Task
) tmp;

Characteristics

Requires a database that supports window functions (e.g., Hive, Oracle, PostgreSQL, SQL Server).

The PARTITION BY clause groups rows by task_id; ORDER BY determines the row order inside each group.

Often more efficient than DISTINCT on large datasets because it avoids full‑table deduplication.

4. Comparison Using a Test Table

-- DISTINCT examples
SELECT DISTINCT user_id FROM Test;                     -- returns 1, 2
SELECT DISTINCT user_id, user_type FROM Test;           -- returns (1,1), (1,2), (2,1)

-- GROUP BY examples
SELECT user_id FROM Test GROUP BY user_id;              -- returns 1, 2
SELECT user_id, user_type FROM Test GROUP BY user_id, user_type; -- returns (1,1), (1,2), (2,1)

The examples illustrate that DISTINCT removes duplicate rows across all selected columns, while GROUP BY deduplicates only the columns listed after it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLdeduplicationdatabasesGROUP BYDISTINCTROW_NUMBERwindow function
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.