Databases 13 min read

Handling Duplicate Data in MySQL: Techniques and Examples

This article explains how to identify and remove various kinds of duplicate data in MySQL—including fully duplicated rows, records with duplicate non‑key columns, and unwanted whitespace inside fields—by using SQL statements, table cloning, OS utilities, and regular‑expression updates, with performance measurements for each method.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Handling Duplicate Data in MySQL: Techniques and Examples

Recently the author helped a client migrate data from DM (DaMeng) to MySQL and uses the opportunity to discuss how to handle duplicate data in a database.

Three kinds of data stored in a database:

1) Data that has been strictly filtered (e.g., program‑side validation, column CHECK constraints, triggers, stored procedures). 2) Raw data that has never been processed, which often includes garbage and duplicate records. 3) Duplicate data that may be generated during SQL execution (e.g., NULLs from outer joins). The article focuses only on the second type—original duplicate data.

Scenario 1 – Completely duplicated rows (no primary key)

mysql-(ytt/3305)->show create table d1\G
*************************** 1. row ***************************
Table: d1
Create Table: CREATE TABLE `d1` (
`r1` int(11) DEFAULT NULL,
`r2` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

The table contains four million rows, three‑quarters of which are duplicates.

mysql-(ytt/3305)->select count(*) from d1 limit 2;
+----------+
| count(*) |
+----------+
|  4000000 |
+----------+
1 row in set (0.18 sec)

Distinct rows are only one million:

mysql-(ytt/3305)->select count(distinct r1,r2) from d1 ;
+-----------------------+
| count(distinct r1,r2) |
+-----------------------+
|               1000000 |
+-----------------------+
1 row in set (2.68 sec)

Deduplication can be done entirely inside MySQL:

mysql-(ytt/3305)->create table d2 like d1;
Query OK, 0 rows affected (0.01 sec)
mysql-(ytt/3305)->insert into d2 select distinct r1,r2 from d1;
Query OK, 1000000 rows affected (19.40 sec)
Records: 1000000  Duplicates: 0  Warnings: 0
mysql-(ytt/3305)->alter table d1 rename to d1_bak;
Query OK, 0 rows affected (0.00 sec)
mysql-(ytt/3305)->alter table d2 rename to d1;
Query OK, 0 rows affected (0.00 sec)
mysql-(ytt/3305)->drop table d1_bak;
Query OK, 0 rows affected (0.00 sec)

The whole process took about 20 seconds. An alternative is to export the table, use OS tools, and re‑import:

mysql-(ytt/3305)->select * from db1 into outfile '/var/lib/mysql-files/d1.txt';
Query OK, 4000000 rows affected (1.84 sec)
root@ytt-pc:/var/lib/mysql-files# time cat d1.txt | sort -g | uniq > d1_uniq.txt
real    0m7.345s
user    0m7.528s
sys     0m0.272s
mysql-(ytt/3305)->truncate table d1;
Query OK, 0 rows affected (0.05 sec)
root@ytt-pc:/var/lib/mysql-files# mv d1_uniq.txt d1.txt
mysqlimport -uytt -pytt -P3305 -h 127.0.0.1 --use-threads=2 -vvv ytt /var/lib/mysql-files/d1.txt
mysqlimport: [Warning] Using a password on the command line interface can be insecure.
Connecting to 127.0.0.1
Selecting database ytt
Loading data from SERVER file: /var/lib/mysql-files/d1.txt into d1
ytt.d1: Records: 1000000  Deleted: 0  Skipped: 0  Warnings: 0
real    0m3.272s

OS‑level deduplication was roughly twice as fast as the pure‑SQL method.

Scenario 2 – Table with primary key but duplicate non‑key columns

mysql-(ytt/3305)->select * from d4 order by r1,r2 limit 5;
+---------+------+------+
| id      | r1   | r2   |
+---------+------+------+
|       1 |    1 |    1 |
| 3000001 |    1 |    1 |
| 2000001 |    1 |    1 |
| 1000001 |    1 |    1 |
|       2 |    2 |    2 |
+---------+------+------+

To keep only the row with the maximum id for each (r1,r2) pair, a single DELETE…JOIN statement is used:

mysql-(ytt/3305)->delete a from d4 a left join (select max(id) id from d4 group by r1, r2) b using(id) where b.id is null;
Query OK, 3000000 rows affected (23.29 sec)

After deletion the table contains one million distinct rows:

mysql-(ytt/3305)->select count(*) from d4;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+

Resulting rows show the retained maximum ids.

Scenario 3 – Duplicate characters inside field values (extra spaces, line breaks, tabs)

1) Trim leading/trailing whitespace using MySQL’s TRIM() function:

mysql-(ytt/3305)->update y11 set r1 = trim(r1), r2 = trim(r2);
Query OK, 5242880 rows affected (2 min 1.56 sec)

2) Collapse multiple internal whitespace characters to a single space with REGEXP_REPLACE() :

mysql-(ytt/3305)->update y11 set r1 = regexp_replace(r1,'[[:space:]]+',' '), r2 = regexp_replace(r2,'[[:space:]]+',' ');
Query OK, 4194304 rows affected (1 min 32.05 sec)

When MySQL functions are insufficient, the data can be exported, cleaned with Linux tools (e.g., sed -i 's/\s\+/ /g' y11.txt ), and re‑imported:

root@ytt-pc:/var/lib/mysql-files# time sed -i 's/\s\+/ /g' y11.txt
real    0m27.476s
mysql-(ytt/3305)->load data infile '/var/lib/mysql-files/y11.txt' into table y11 fields terminated by ',' enclosed by '"';
Query OK, 5242880 rows affected (30.25 sec)

All three scenarios demonstrate practical ways to clean duplicate data in MySQL, choosing between pure‑SQL solutions and OS‑level utilities depending on data volume and performance requirements.

Hope these techniques help with everyday data‑cleaning tasks.

SQLmysqldatabase migrationdata cleaningdata deduplication
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.