Databases 5 min read

Efficient Methods for Removing Duplicate Records in MySQL Tables

This article explains why a naïve Python‑based row‑deletion approach is slow for large MySQL tables and provides step‑by‑step SQL techniques—including identifying duplicate names, handling MySQL’s update‑from‑same‑table limitation, and deleting duplicates while preserving a single record per group—complete with executable code examples.

Laravel Tech Community

Mar 29, 2022

Efficient Methods for Removing Duplicate Records in MySQL Tables

The author encountered six tables with duplicate data, two of which contain over 960,000 and 300,000 rows respectively, and found a previously used Python script too slow (about one second per row, estimating eight hours for 20,000 duplicates).

After reflecting on the need for deeper problem‑solving, the article proceeds to detail the deduplication steps using pure SQL.

CREATE TABLE `animal` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

INSERT INTO `pilipa_dds`.`student` (`id`,`name`,`age`) VALUES ('1','cat','12');
INSERT INTO `pilipa_dds`.`student` (`id`,`name`,`age`) VALUES ('2','dog','13');
INSERT INTO `pilipa_dds`.`student` (`id`,`name`,`age`) VALUES ('3','camel','25');
INSERT INTO `pilipa_dds`.`student` (`id`,`name`,`age`) VALUES ('4','cat','32');
INSERT INTO `pilipa_dds`.`student` (`id`,`name`,`age`) VALUES ('5','dog','42');

The goal is to remove rows with identical name values.

First, identify duplicate names:

SELECT name, COUNT(1)
FROM student
GROUP BY name
HAVING COUNT(1) > 1;

The result shows that cat and dog each appear twice.

Attempting a direct delete fails with MySQL error 1093 because the same table is used in the subquery:

DELETE FROM student
WHERE NAME IN (
  SELECT NAME
  FROM student
  GROUP BY NAME
  HAVING COUNT(1) > 1
);

MySQL does not allow updating a table while selecting from it. The workaround is to materialize the subquery as a derived table:

DELETE FROM student
WHERE NAME IN (
  SELECT t.NAME
  FROM (
    SELECT NAME
    FROM student
    GROUP BY NAME
    HAVING COUNT(1) > 1
  ) t
);

To keep one record per duplicate group, first select the minimum id for each name (the row to retain), then delete rows whose id is not in that set:

SELECT *
FROM student
WHERE id NOT IN (
  SELECT t.id
  FROM (
    SELECT MIN(id) AS id
    FROM student
    GROUP BY name
  ) t
);

Finally, replace the SELECT with DELETE to perform the actual removal:

DELETE FROM student
WHERE id NOT IN (
  SELECT t.id
  FROM (
    SELECT MIN(id) AS id
    FROM student
    GROUP BY name
  ) t
);

This approach runs very quickly even on tables with over 900,000 rows. The article concludes with a celebratory note.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL mysql data cleaning duplicate removal

Written by

Laravel Tech Community

Specializing in Laravel development, we continuously publish fresh content and grow alongside the elegant, stable Laravel framework.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.