Mastering Distributed Unique IDs: From Auto‑Increment to Snowflake
This article surveys various distributed unique ID generation techniques—including Oracle auto‑increment, Redis batch allocation, UUID versions, Hibernate and MongoDB strategies, and Twitter’s Snowflake—explaining their structures, trade‑offs, and practical implementation considerations for reliable, collision‑free identifiers.
Distributed unique IDs are used everywhere, from business object IDs to trace IDs in logs; this article summarizes a wide range of generation algorithms.
1. ID Generators
The earliest unique ID many encounter is Oracle's auto‑increment ID, which provides quasi‑sequential numbers. For performance, a client fetches a batch (e.g., 20 IDs) at a time. Sina Weibo’s Tim uses Redis in the same way, incrementing a key to obtain a batch; multiple data centers can reserve high‑order bits to differentiate them. A single 64‑bit long is sufficient when the architecture tolerates the added Redis complexity, and batching is essential to avoid a remote call per ID.
2. UUID
2.1 Overview
Universally Unique Identifier (UUID) is defined by an RFC, consists of 128 bits (32 hexadecimal characters with hyphens). Its layout includes a timestamp + version (60 bits + 4 bits), a clock sequence and reserved field (13 bits + 3 bits), and a node identifier (48 bits). Example: f81d4fae-7dec-11d0-a765-00a0c91e6bf6.
UUID has several versions; those usable for trace IDs are:
Version 1: time‑based algorithm.
Version 4: random‑based algorithm.
Version 4
Version 4 is the most aggressive approach, used by the JDK. It ignores the original meaning of most bits, fills them with random data generated by SecureRandom, and stores the 16 random bytes in two long s. The JVM should be started with -Djava.security.egd=file:/dev/./urandom to avoid blocking.
Version 1
Version 1 strictly follows the RFC: a 60‑bit timestamp (100‑nanosecond units since 1582‑10‑15, lasting about 3655 years), a 48‑bit node identifier (usually a MAC address, or a hash of host information), and a 16‑bit sequence to avoid collisions when the node or clock changes. In practice, strict Version 1 implementations are rare because they do not handle multiple processes on the same machine or concurrent identical timestamps.
3. Version 1 Variant – Hibernate
Hibernate’s CustomVersionOneStrategy.java addresses the two problems of the original Version 1:
Timestamp: 6 bytes (48 bits) in milliseconds since 1970, lasting about 8925 years.
Sequence: 2 bytes (16 bits), wraps to zero after reaching 65535.
Machine identifier: 4 bytes (32 bits) derived from the IPv4 address (or the first 4 bytes of IPv6).
Process identifier: 4 bytes (32 bits) obtained by shifting the current timestamp right 8 bits.
The combination of machine and process identifiers forms an almost‑constant 64‑bit long; only the second long changes.
4. Version 1 Variant – MongoDB
MongoDB’s ObjectId consists of:
Timestamp: 4 bytes (32 bits) in seconds since 1970, lasting 136 years.
Increment: 3 bytes (24 bits) – a counter that starts from a random value and increments, never resetting each second.
Machine identifier: 3 bytes (24 bits) – a hash of all MAC addresses or a random value if no NIC is present.
Process identifier: 2 bytes (16 bits) – obtained from JMX or a hash of the process name.
The total size is 12 bytes (96 bits); storing it in a 64‑bit long requires conversion to a byte array or hexadecimal string.
5. Twitter’s Snowflake
Snowflake is another allocator, built on a Thrift service. It packs a single 64‑bit long as follows:
Timestamp: 42 bits (milliseconds since 2012), lasting 139 years.
Sequence: 12 bits (max 4096) per millisecond.
DataCenter ID: 5 bits (max 32), configurable.
Worker ID: 5 bits (max 32), configurable; a data center can have up to 32 workers, registered in Zookeeper.
Because Snowflake omits machine and process identifiers, a single long suffices, but each client can only request one ID at a time, adding latency.
6. Final Question – Can We Eliminate the Allocator and Fit UUID into a Long?
If the ID type is a 64‑bit long and we want to generate it locally without a central allocator, the challenge is encoding both the machine and process identifiers.
Idea 1: Compress other fields, reserving enough bits for machine + process. For example, use a second‑level timestamp (≈24 bits per year), a 16‑bit sequence for 60 k QPS, and allocate the remaining 20‑24 bits for a hashed combination of MAC address and process ID.
Idea 2: Use Zookeeper, MySQL, or Redis to manage the identifier. If the worker field is only 12 bits (4096 workers), ZK or etcd can allocate and recycle IDs on process shutdown. If more bits are available (e.g., 20 bits), a simple auto‑increment in Redis or MySQL suffices.
Idea 3: Continue using randomness. Take the low‑order long of UUID.randomUUID() (the high‑order long has fixed bits per the RFC) or directly use SecureRandom.nextLong(), sacrificing the three reserved bits.
Source: Jiangnan Baiyi URL: http://calvin1978.blogcn.com/articles/uuid.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
