Techniques for Removing Duplicate Rows and Adding Unique Constraints in PostgreSQL
This article examines common scenarios where duplicate rows prevent creating unique constraints, presents eight SQL‑based methods—including array functions, window functions, NOT IN/EXISTS, combined IN/NOT IN, EXISTS/NOT EXISTS, single‑statement deletes, and table‑copy approaches—provides test data, performance comparisons, and practical guidance for DBA operations.
Problem Statement
It is common to encounter a case where, after a database instance has been running for a while, a unique constraint (or index) needs to be added to one or more columns of a table.
When attempting to create the unique constraint, PostgreSQL returns DETAIL: Key (col)=(value) is duplicated!
Therefore the duplicate rows must be identified and only one copy retained while the redundant rows are deleted.
Analysis
The key to deleting data is locating the rows that should be removed and the rows that should be kept.
Solution
Assume the business requirement is to keep, for each group of duplicate info values in the test table, the row with the smallest id .
Method 1 – Forward Thinking, using ARRAY
Leverage PostgreSQL's powerful array data type and its functions to locate rows to delete in a single pass.
Method 2 – Forward Thinking, using Window Functions
The same idea as Method 1, but demonstrated with PostgreSQL window functions.
Method 3 – Reverse Thinking, using NOT IN
Exclude the rows that should be kept, thereby locating the rows to delete.
Method 4 – Reverse Thinking, using NOT EXISTS
Same logic as Method 3, expressed with a NOT EXISTS sub‑query.
Method 5 – Combined Forward/Reverse, using IN and NOT IN
First locate the large set of groups that contain duplicates, then subtract the small set of rows that should be kept.
Method 6 – Combined Forward/Reverse, using EXISTS and NOT EXISTS
Same concept as Method 5, expressed with EXISTS/NOT EXISTS.
Method 7 – Single‑Statement SQL
Identify all groups with duplicate values, then for each group keep the row with the smallest id and delete the rest using a single DELETE statement.
Method 8 – Copy Data to a New Table
If a short write‑pause is acceptable, copy the distinct rows to a new table, then rename tables. This approach can be wrapped in a transaction to guarantee atomicity.
放在事务里是为了保证所做操作原子性, 避免出现瞬间无表可用的窗口期
Note: For comparison purposes Method 8 keeps the id column; if id is not retained, a GROUP BY can be slightly faster than DISTINCT .
Test Data
Because query performance varies with data distribution, three test data sets were generated for comparison.
Data Generation
Data Distribution
Method Comparison
Analysis of the table shows:
• Forward‑thinking methods (1, 2) have average execution time that grows with the amount of duplicate data; when duplicates are few, Method 2 is recommended.
• Reverse‑thinking methods (3, 4) actually become faster as duplicate volume increases; when many duplicates exist, Method 4 is preferred.
• Combined approaches (5, 6) do not provide a performance advantage because they require two sub‑queries.
• Method 7 has the longest total execution time (it scales with the number of generated DELETE statements), but it minimizes impact on the DB instance by breaking a long lock on many rows into many very short locks on individual rows.
• In high‑load production environments, Method 7 is recommended.
• Method 8 involves extra steps and, in production, may be slower because the whole table (including all indexes) must be rebuilt; however, it also performs a full table maintenance, reducing total size (including indexes) for tables with heavy UPDATE/DELETE activity, leading to noticeable performance gains.
Conclusion
Review the DELETE syntax:
Other practical techniques include using WITH queries to build temporary tables, employing USING/using_list instead of sub‑queries, or encapsulating Method 1 in a stored procedure (not recommended for large transactions).
The overall strategy remains the same: identify duplicate rows, keep the desired row per group, delete the rest, then create the unique index.
DBA Daily‑Operation Selection Principles
For high‑pressure online production databases (especially transaction systems), choose the method that has the smallest impact on production.
For lower‑pressure production or development instances, prefer the fastest‑executing single‑SQL method.
After duplicate rows are removed, the unique index can be created as follows:
This resolves the problem; readers are encouraged to share any more efficient or clever solutions.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.