Removing Duplicate Data in MySQL and Keeping Only One Record
This article describes how to identify and delete duplicate rows in MySQL tables—first showing a naive Python‑based approach, then presenting efficient SQL queries that locate duplicate names, explain MySQL's limitation on deleting from the same table being queried, and provide a safe sub‑query solution to remove all duplicates while preserving a single representative row.
While working overtime to resolve an online database issue, the author discovered duplicate data across six tables, two of which contain over 960,000 and 300,000 rows respectively. A previously used Python script was reused to connect to the database, fetch duplicate rows, and delete them one by one.
The script proved extremely slow—about one second per row—so deleting roughly 20,000 duplicate rows would take around eight hours, prompting a reflection on the drawbacks of blindly reusing old solutions without reconsideration.
The author then outlines the detailed deduplication steps.
<code>CREATE TABLE `animal` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1','cat','12');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2','dog','13');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3','camel','25');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4','cat','32');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5','dog','42');</code>The goal is to remove rows that have the same name value.
First, identify which names are duplicated:
<code>SELECT name, COUNT(1)
FROM student
GROUP BY name
HAVING COUNT(1) > 1;</code>The result shows that cat and dog each appear twice.
A generic delete statement such as:
<code>DELETE FROM student
WHERE name IN (
SELECT name FROM student GROUP BY name HAVING COUNT(1) > 1
);</code>fails with MySQL error 1093 because the target table is referenced in the sub‑query. MySQL does not allow updating (or deleting) a table while selecting from the same table in a nested query.
To work around this, the sub‑query is wrapped in an alias (a derived table) so that MySQL treats it as a separate source:
<code>DELETE FROM student
WHERE name IN (
SELECT t.name FROM (
SELECT name FROM student GROUP BY name HAVING COUNT(1) > 1
) t
);</code>To keep only one row per duplicate name, first select the smallest id for each name (these rows will be kept):
<code>SELECT * FROM student
WHERE id NOT IN (
SELECT t.id FROM (
SELECT MIN(id) AS id FROM student GROUP BY name
) t
);</code>Then replace the SELECT with DELETE to actually remove the unwanted duplicates:
<code>DELETE FROM student
WHERE id NOT IN (
SELECT t.id FROM (
SELECT MIN(id) AS id FROM student GROUP BY name
) t
);</code>This approach runs very quickly even on tables with hundreds of thousands of rows, and all duplicate data is removed while a single representative row for each name remains.
All done 👏👏👏👏~
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.