Why MySQL Auto‑Increment Can Leak Data and How Distributed IDs Offer a Safer Alternative
The article examines the 2012 GitHub data‑leak caused by MySQL master‑slave failover and auto_increment misuse, explains why developers' expectations of uniqueness, monotonicity and continuity are unrealistic, and proposes half‑sync replication and distributed ID algorithms like Snowflake as more reliable solutions.
GitHub Incident Overview
In September 2012 GitHub suffered a private‑data leak because the MySQL cluster’s primary node became overloaded, causing the heartbeat check to fail and an out‑of‑sync replica to be promoted as the new master.
The original design used an auto_increment column as the primary key. The newly promoted master lagged behind the old master’s counter, reusing IDs that had already been assigned.
Those IDs were also referenced by an external Redis cache, leading to inconsistencies between MySQL and Redis and ultimately exposing private user data to other accounts.
Developers’ Naïve Expectations of Auto‑Increment
Auto‑increment appears to provide three desirable properties:
Uniqueness – a fundamental requirement for primary keys.
Monotonic increase – later rows receive larger IDs than earlier rows.
Continuous increase – the counter increments by exactly one each time.
These expectations stem from the efficiency of atomic counters, but reality often contradicts them.
Why Those Expectations Fail
Monotonicity: The value is fetched from the auto_increment counter before being written to the redo log, a non‑atomic, multi‑threaded operation. The persisted value may not remain monotonic.
Continuity: Transactions can roll back, but the auto_increment counter does not revert, breaking the “continuous” assumption.
Consequently, only uniqueness can be guaranteed; the other two properties are illusory.
Preventing Data Loss During Master‑Slave Switches
To avoid the GitHub‑style data loss, configure the replication cluster with at least one semi‑synchronous replica: one slave operates in synchronous mode while the others remain asynchronous.
Do not set all slaves to synchronous mode, as this would force every write to wait for all replicas, severely degrading performance and risking total outage if any slave crashes.
Additional pitfalls to watch for:
MySQL REPLACE statements can desynchronize auto_increment values between master and slaves; avoid using REPLACE until MySQL 8.0 where the bug is fixed.
Complex INSERT ... SELECT statements with binlog_format=STATEMENT may cause different index choices on master and slave, leading to divergent insertion order. Use binlog_format=ROW to prevent this.
Distributed ID Generation as a Better Choice
Many systems adopt distributed ID algorithms (e.g., Twitter’s Snowflake, Sonyflake) that embed timestamps and machine identifiers into a 64‑bit integer, ensuring global uniqueness without relying on database auto_increment.
Snowflake’s 64‑bit ID consists of:
1 sign bit (unused).
41 bits for a millisecond‑precision timestamp.
10 bits for a machine ID (supporting up to 1024 nodes).
12 bits for a sequence number (4096 IDs per millisecond per node).
With this structure Snowflake can sustain roughly 4.19 million IDs per second, sufficient for most applications. However, clock rollback must be handled to avoid duplicate IDs.
MongoDB’s ObjectID follows a similar principle, using a 12‑byte structure.
If GitHub had used a distributed ID as the primary key, a master‑slave switch would not cause ID collisions, preventing the data‑leak scenario.
Practical Guidance
For ordinary developers, auto_increment remains a convenient choice. For architects designing large‑scale, highly available systems, understanding and possibly adopting distributed ID schemes is advisable.
create table `test` (
`id` int(16) NOT NULL AUTO_INCREMENT,
`name` char(10) DEFAULT NULL,
PRIMARY KEY(`id`)
) ENGINE=InnoDB;Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
