Designing Indexes and Primary Keys for Distributed Database Architectures
Effective index and primary key design in distributed database architectures requires using globally unique, ordered identifiers instead of auto‑increment keys, embedding shard information in keys, employing auxiliary index tables, and handling global tables to ensure queries target a single shard, thereby preserving scalability and performance.
Primary Key Selection
In a distributed database, the primary key must be globally unique across all shards; it essentially serves as a global unique index. Using auto‑increment as a primary key causes problems because the value is generated after insertion, leading to duplicate keys on different shards.
For example, an e‑commerce orders table with o_orderkey as an auto‑increment primary key can produce the same o_orderkey value on multiple shards, as illustrated below:
Therefore, avoid using auto‑increment as the primary key in a distributed setup: it has poor performance, low safety, and does not suit sharding.
Instead, use a globally unique ordered identifier such as MySQL's sequential UUID, a business‑generated unique key, or an open‑source algorithm like Snowflake (with caution about time‑rollback).
Index Design
While the shard key routes queries to a specific shard, many business queries still need to access tables via other columns. Using the orders example, a query on o_orderkey would normally require scanning all shards because o_orderkey is not the shard key.
SELECT * FROM orders WHERE o_orderkey = 1;If the query must touch every shard (e.g., 1000 shards), performance degrades significantly.
Two design approaches can address this:
Make o_orderkey the shard key and replicate the table accordingly.
Add shard‑key information to the index.
Both approaches essentially use redundancy to trade space for time, avoiding full‑shard scans.
One practical solution is to create an auxiliary index table that stores o_orderkey and its shard key o_custkey :
CREATE TABLE idx_orderkey_custkey (
o_orderkey INT,
o_custkey INT,
PRIMARY KEY (o_orderkey)
);Querying then becomes a two‑step process, each step using the shard key:
SELECT * FROM orders WHERE o_orderkey = 1;
--> step 1
SELECT o_custkey FROM idx_orderkey_custkey WHERE o_orderkey = 1;
--> step 2
SELECT * FROM orders WHERE o_custkey = ? AND o_orderkey = 1;This ensures both SQL statements can be routed to a single shard, dramatically improving performance regardless of the total number of shards.
A more space‑efficient design embeds the shard key directly into the primary key, e.g., concatenating o_orderkey and o_custkey into a string:
o_orderkey = string(o_orderkey + o_custkey)Now a query like SELECT * FROM Orders WHERE o_orderkey = '1000-1'; instantly reveals the target shard, allowing a single‑shard lookup without extra index tables.
Although the primary key becomes larger, it remains ordered, preserving insert performance while providing fast shard resolution.
For non‑unique secondary indexes, the situation is less favorable; queries still need to scan all shards, e.g.:
SELECT * FROM Orders WHERE o_orderdate >= ? AND o_orderdate < ?;Thus, the guiding principle for distributed databases is that the majority of business requests should be routable to a single shard using the shard key.
Global Tables
Some small tables lack a natural shard key (e.g., a nation lookup table). Because they are tiny and rarely updated, they can be duplicated on every shard, eliminating cross‑shard queries.
Unique Indexes
Unique indexes in a distributed environment must also rely on globally unique identifiers (e.g., UUID) to guarantee uniqueness across shards. Even for single‑node MySQL, using globally unique keys is advisable for future scalability.
Conclusion
Use ordered, globally unique UUIDs for primary keys in distributed databases.
Design unique indexes with globally unique values to avoid local‑only uniqueness issues.
If a unique index is not the shard key, store shard information in the indexed column to enable single‑shard queries.
Replicate small, read‑only global tables on each shard to avoid cross‑shard lookups.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.