Deleting Duplicate Records in MySQL with DELETE JOIN and Subqueries
This article explains how to identify and remove duplicate rows in a MySQL table—keeping a single record per duplicate group—by using a self‑join DELETE statement or a nested subquery approach, complete with sample table creation and data insertion scripts.
Duplicate records in a database table are common due to data migration, bugs, or manual testing inserts, and they often violate business rules; the article shows how to delete the extra rows while preserving one instance using a single SQL statement in MySQL.
1. Preparation
Create a sample t_customer table and insert test data that includes duplicate entries based on cust_name , gender , and email fields.
CREATE TABLE `t_customer` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键', `cust_name` varchar(32) NOT NULL COMMENT '名字', `gender` varchar(10) NOT NULL COMMENT '性别', `email` varchar(32) NOT NULL COMMENT 'email', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
INSERT INTO `t_customer` VALUES ('1','Tom','Male','[email protected]'), ('2','Jennifer','Female','[email protected]'), ('3','Tom','Male','[email protected]'), ('4','John','Male','[email protected]'), ('5','Jennifer','Female','[email protected]'), ('6','Tom','Male','[email protected]');
According to the rule that rows with identical cust_name , gender , and email are considered duplicates, the rows with id 1, 3, 6 and 2, 5 are duplicates.
2. Method 1 – DELETE JOIN
Use a self‑join to delete the rows with a smaller id , keeping the row with the largest id for each duplicate group:
DELETE c1 FROM t_customer c1 INNER JOIN t_customer c2 WHERE c1.id < c2.id AND c1.cust_name = c2.cust_name AND c1.gender = c2.gender AND c1.email = c2.email;
After execution, only one row per duplicate group remains.
3. Method 2 – Subquery
Use GROUP BY with HAVING COUNT(*) > 1 to find duplicate groups, then delete rows whose id is not the maximum within each group:
DELETE FROM t_customer WHERE (cust_name, gender, email) IN ( SELECT cust_name, gender, email FROM ( SELECT cust_name, gender, email FROM t_customer GROUP BY cust_name, gender, email HAVING COUNT(*) > 1 ) t1 ) AND id NOT IN ( SELECT id FROM ( SELECT MAX(id) AS id FROM t_customer GROUP BY cust_name, gender, email HAVING COUNT(*) > 1 ) t2 );
This approach is more complex due to nested subqueries, but achieves the same result.
When running such statements on a production database, always back up the data and test thoroughly to avoid accidental data loss.
Finally, the author asks readers to like, share, and follow the public account to support further content creation.
Recommended Reading:
One SQL solves the problem – interviewers impressed
MySQL table relationship updates
High‑frequency interview question: multithreaded ordered printing
Comprehensive guide to MySQL index loss scenarios
Design pattern usage tips
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.