Databases 6 min read

Deleting Duplicate Records in MySQL with DELETE JOIN and Subqueries

This article explains how to identify and remove duplicate rows in a MySQL table—keeping a single record per duplicate group—by using a self‑join DELETE statement or a nested subquery approach, complete with sample table creation and data insertion scripts.

Full-Stack Internet Architecture

Aug 23, 2023

Deleting Duplicate Records in MySQL with DELETE JOIN and Subqueries

Duplicate records in a database table are common due to data migration, bugs, or manual testing inserts, and they often violate business rules; the article shows how to delete the extra rows while preserving one instance using a single SQL statement in MySQL.

1. Preparation

Create a sample t_customer table and insert test data that includes duplicate entries based on cust_name, gender, and email fields.

CREATE TABLE `t_customer` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',
  `cust_name` varchar(32) NOT NULL COMMENT '名字',
  `gender` varchar(10) NOT NULL COMMENT '性别',
  `email` varchar(32) NOT NULL COMMENT 'email',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO `t_customer` VALUES
('1','Tom','Male','[email protected]'),
('2','Jennifer','Female','[email protected]'),
('3','Tom','Male','[email protected]'),
('4','John','Male','[email protected]'),
('5','Jennifer','Female','[email protected]'),
('6','Tom','Male','[email protected]');

According to the rule that rows with identical cust_name, gender, and email are considered duplicates, the rows with id 1, 3, 6 and 2, 5 are duplicates.

2. Method 1 – DELETE JOIN

Use a self‑join to delete the rows with a smaller id, keeping the row with the largest id for each duplicate group:

DELETE c1 FROM t_customer c1
INNER JOIN t_customer c2
WHERE c1.id < c2.id
  AND c1.cust_name = c2.cust_name
  AND c1.gender = c2.gender
  AND c1.email = c2.email;

After execution, only one row per duplicate group remains.

3. Method 2 – Subquery

Use GROUP BY with HAVING COUNT(*) > 1 to find duplicate groups, then delete rows whose id is not the maximum within each group:

DELETE FROM t_customer WHERE (cust_name, gender, email) IN (
  SELECT cust_name, gender, email FROM (
    SELECT cust_name, gender, email FROM t_customer
    GROUP BY cust_name, gender, email
    HAVING COUNT(*) > 1
  ) t1
) AND id NOT IN (
  SELECT id FROM (
    SELECT MAX(id) AS id FROM t_customer
    GROUP BY cust_name, gender, email
    HAVING COUNT(*) > 1
  ) t2
);

This approach is more complex due to nested subqueries, but achieves the same result.

When running such statements on a production database, always back up the data and test thoroughly to avoid accidental data loss.

Finally, the author asks readers to like, share, and follow the public account to support further content creation.

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.