Databases 10 min read

Techniques for Removing Duplicate Rows and Adding Unique Constraints in PostgreSQL

This article examines common scenarios where duplicate rows prevent creating unique constraints, presents eight SQL‑based methods—including array functions, window functions, NOT IN/EXISTS, combined IN/NOT IN, EXISTS/NOT EXISTS, single‑statement deletes, and table‑copy approaches—provides test data, performance comparisons, and practical guidance for DBA operations.

Qunar Tech Salon

Jun 11, 2017

Techniques for Removing Duplicate Rows and Adding Unique Constraints in PostgreSQL

Problem Statement

It is common to encounter a case where, after a database instance has been running for a while, a unique constraint (or index) needs to be added to one or more columns of a table.

When attempting to create the unique constraint, PostgreSQL returns DETAIL: Key (col)=(value) is duplicated! Therefore the duplicate rows must be identified and only one copy retained while the redundant rows are deleted.

Analysis

The key to deleting data is locating the rows that should be removed and the rows that should be kept.

Solution

Assume the business requirement is to keep, for each group of duplicate info values in the test table, the row with the smallest id.

Method 1 – Forward Thinking, using ARRAY

Leverage PostgreSQL's powerful array data type and its functions to locate rows to delete in a single pass.

Method 2 – Forward Thinking, using Window Functions

The same idea as Method 1, but demonstrated with PostgreSQL window functions.

Method 3 – Reverse Thinking, using NOT IN

Exclude the rows that should be kept, thereby locating the rows to delete.

Method 4 – Reverse Thinking, using NOT EXISTS

Same logic as Method 3, expressed with a NOT EXISTS sub‑query.

Method 5 – Combined Forward/Reverse, using IN and NOT IN

First locate the large set of groups that contain duplicates, then subtract the small set of rows that should be kept.

Method 6 – Combined Forward/Reverse, using EXISTS and NOT EXISTS

Same concept as Method 5, expressed with EXISTS/NOT EXISTS.

Method 7 – Single‑Statement SQL

Identify all groups with duplicate values, then for each group keep the row with the smallest id and delete the rest using a single DELETE statement.

Method 8 – Copy Data to a New Table

If a short write‑pause is acceptable, copy the distinct rows to a new table, then rename tables. This approach can be wrapped in a transaction to guarantee atomicity.

放在事务里是为了保证所做操作原子性, 避免出现瞬间无表可用的窗口期

Note: For comparison purposes Method 8 keeps the id column; if id is not retained, a GROUP BY can be slightly faster than DISTINCT.

Test Data

Because query performance varies with data distribution, three test data sets were generated for comparison.

Data Generation

Data Distribution

Method Comparison

Analysis of the table shows:

• Forward‑thinking methods (1, 2) have average execution time that grows with the amount of duplicate data; when duplicates are few, Method 2 is recommended.

• Reverse‑thinking methods (3, 4) actually become faster as duplicate volume increases; when many duplicates exist, Method 4 is preferred.

• Combined approaches (5, 6) do not provide a performance advantage because they require two sub‑queries.

• Method 7 has the longest total execution time (it scales with the number of generated DELETE statements), but it minimizes impact on the DB instance by breaking a long lock on many rows into many very short locks on individual rows.

• In high‑load production environments, Method 7 is recommended.

• Method 8 involves extra steps and, in production, may be slower because the whole table (including all indexes) must be rebuilt; however, it also performs a full table maintenance, reducing total size (including indexes) for tables with heavy UPDATE/DELETE activity, leading to noticeable performance gains.

Conclusion

Review the DELETE syntax:

Other practical techniques include using WITH queries to build temporary tables, employing USING/using_list instead of sub‑queries, or encapsulating Method 1 in a stored procedure (not recommended for large transactions).

The overall strategy remains the same: identify duplicate rows, keep the desired row per group, delete the rest, then create the unique index.

DBA Daily‑Operation Selection Principles

For high‑pressure online production databases (especially transaction systems), choose the method that has the smallest impact on production.

For lower‑pressure production or development instances, prefer the fastest‑executing single‑SQL method.

After duplicate rows are removed, the unique index can be created as follows:

This resolves the problem; readers are encouraged to share any more efficient or clever solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PostgreSQL sql-optimization Database Administration duplicate rows unique constraint

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.