Why PhxSQL Rejects Multi-Write, Sharding, and Serializability: Design Trade‑offs
This article explains how PhxSQL prioritizes strong linearizable consistency, high availability, serializable isolation, and full MySQL compatibility, and why it deliberately forgoes features such as multi‑write, sharding, and strict serializable isolation due to the high cost of distributed transactions and protocol complexity.
7. Why Not?
7.1. Why not support multi‑write?
Multi‑write is attractive because it can fully utilize the write‑side resources of each machine, improve throughput, eliminate leader‑switch windows, and reduce cross‑data‑center latency. However, it introduces expensive distributed transactions.
There are two types of multi‑write: inter‑shard (or inter‑group) parallel writes, exemplified by Google Spanner, and intra‑shard parallel writes, used by Galera and MySQL Group Replication.
7.1.1. Inter‑group multi‑write
Data is partitioned into disjoint shards, each handled by a group of machines. A transaction that accesses data within a single group is a local transaction; if it spans multiple groups, it becomes a distributed transaction.
Local transactions can be executed independently within a group, typically by a group leader using Paxos or similar consensus to ensure consistency. Parallel execution across groups greatly boosts write performance for local‑type transactions.
The main obstacle to inter‑group multi‑write is the cost of distributed transactions, which require strict two‑phase locking (2PL) or multi‑version concurrency control (MVCC) across groups, leading to high communication overhead.
For example, a Spanner transaction involving two groups requires multiple coordinator‑leader and participant‑leader communications, plus additional Paxos writes, which become prohibitively expensive over wide‑area networks.
7.1.2. Intra‑group multi‑master write
In this model each machine holds a full copy of a logical data set, but the data is partitioned into non‑overlapping collections, each with a designated primary. Writes are sent to the primary, which then uses atomic broadcast to propagate updates to all replicas, as done by Galera and MySQL Group Replication.
Atomic broadcast guarantees three properties:
When a machine delivers a message, all other machines also deliver it.
All machines execute commands in the same order.
If a machine successfully broadcasts a message, eventually all machines will execute it.
Using atomic broadcast changes the transaction lifecycle from prepare → committed/aborted to prepare → committing → committed/aborted.
Complex transactions
When a transaction touches data from multiple collections, it becomes a complex transaction. The primary of each involved collection broadcasts the committing state (including read and write sets) to the whole group for arbitration. Galera and MySQL Group Replication only validate write sets, so they do not support serializable isolation.
Each machine can independently decide whether a committing transaction should be finalized, based on the guarantees of atomic broadcast.
Deferred Update Replication (or Certification‑based Replication) ensures that a transaction’s final outcome is decided only after all preceding messages have been processed, introducing a degree of serialization.
Atomic broadcast performs well in low‑latency LANs but its latency scales with the number of machines, making it less suitable for wide‑area networks. Protocols like Totem, used by Corosync and Spread, suffer from token‑passing delays that increase with cluster size.
In a typical two‑site three‑data‑center deployment, a single transaction can incur tens of milliseconds of network delay, whereas PhxSQL’s Paxos‑based writes can complete in about 4 ms under the same conditions.
However, if any node crashes or the network partitions, Totem times out and the entire cluster stalls until membership is re‑established.
Read‑only transactions require either reading from the primary, quorum reads, or techniques like Spanner’s TrueTime to achieve linearizability; Galera does not provide linearizable reads.
7.2. Why not support sharding?
Sharding offers unlimited read/write scalability but introduces costly distributed transactions and breaks full MySQL compatibility. Implementing sharding without violating the “minimal intrusion” principle would require substantial changes to MySQL.
7.3. Why insist on serializable isolation when repeatable‑read often suffices?
Serializable isolation prevents anomalies such as write‑skew, which can lead to incorrect business logic (e.g., exceeding a combined credit limit in a banking scenario). For critical workloads, serializable guarantees are essential.
Example (simplified):
sub_A(amount_a):
begin transaction
if (A + B + amount_a <= 2000) {
A += amount_a
}
commit
sub_B(amount_b):
begin transaction
if (A + B + amount_b <= 2000) {
B += amount_b
}
commitUnder repeatable‑read both sub‑A and sub‑B could succeed, violating the credit limit, whereas serializable isolation ensures only one succeeds.
7.4. Why not aim primarily at boosting MySQL performance?
PhxSQL already improves MySQL primary‑standby write performance by 15‑20 % over semi‑sync while keeping read performance unchanged, all without breaking MySQL compatibility.
7.5. Why require C++11 or newer for compilation?
C++11 introduces powerful language features—templates, lambdas, move semantics, and a standardized memory model—that greatly enhance development productivity, code correctness, and runtime performance.
Conclusion
PhxSQL provides a MySQL‑compatible cluster with Zookeeper‑level strong consistency and availability, deliberately sacrificing features like multi‑write, sharding, and certain isolation levels to maintain simplicity, compatibility, and minimal intrusion.
For more details, see the references below.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
