When Should You Choose Distributed Over Centralized Databases? A Practical Guide
This article examines the current landscape of Chinese databases, compares centralized and distributed architectures, outlines when distributed solutions are truly needed, provides performance test data, and offers practical advice on sharding, SQL design, and avoiding cross‑node bottlenecks.
Usage Status Analysis
In 2022, China had more than 200 domestic database vendors. Traditional centralized products (e.g., 人大金仓, 达梦) dominate the market, while newer offerings such as GaussDB, Kingbase, TDSQL, GoldenDB and OceanBase support both centralized and distributed deployment modes. Some distributed vendors still require a compute node (CN) that routes client connections to data nodes, adding a parsing layer that can affect latency; others provide direct JDBC/ODBC or VIP connections to bypass this overhead.
The 2022 Financial Industry Database Supply‑Chain Security Report shows that centralized databases account for 89% of the overall market (80% in banking, >90% in securities and insurance). Distributed databases represent about 7% overall but exceed 17% in banking, indicating that most financial workloads can be satisfied with centralized solutions.
Do You Really Need a Distributed Architecture?
Centralized databases are simple, easy to operate, highly compatible and cost‑effective, but they cannot break single‑machine hardware limits, lack horizontal scalability, and may hit performance or capacity bottlenecks. Before moving to a distributed system, consider the following questions:
Can the problem be solved by tuning the existing centralized database? (parameter adjustments, SQL optimization, business‑logic changes)
Can additional hardware resources resolve the issue? (more CPU, memory, or switching from VM to bare‑metal)
Is storage‑compute separation an option? (external storage or disaggregated architecture to overcome disk‑capacity limits)
Can the application layer address the challenge? (micro‑services, data partitioning, distributed transactions while keeping the DB centralized)
Do you fully understand the operational trade‑offs of a distributed system? (backup, maintenance, cost, complexity)
When to Adopt Distributed Databases
Historically, tables larger than 20 million rows were a trigger for sharding based on B‑tree leaf‑node calculations. Modern hardware and caching have reduced I/O impact, so decisions now often rely on throughput or data‑size thresholds, for example:
Single‑node TPS > 4 000
QPS > 80 000
Data size > 2 TB
Experimental data on a Kunpeng ARM VM (16 CPU × 64 GB, Kylin V10, SSD) shows a four‑shard distributed setup can achieve up to 5× performance improvement for full‑table scans and joins on 5 million rows, while point‑lookup latency remains comparable to a centralized node.
Another benchmark using sysbench on a mid‑range server indicates a centralized database reaches a maximum of ~4 595 TPS at 75 % CPU utilization with ~5 ms latency; exceeding ~5 000 TPS typically signals the need for sharding.
How to Make the Most of Distributed Databases
Choose an appropriate sharding key : high cardinality, evenly distributed values; preferably the primary key. Avoid changing the sharding key after deployment.
Select a distribution method : hash‑based sharding gives uniform data spread; range or list sharding may suit specific query patterns. Define frequently accessed small tables as global tables to eliminate cross‑node joins.
Write SQL that includes the sharding key in predicates and join conditions to prevent costly cross‑node data transfer.
Minimize cross‑node traffic because network latency far exceeds disk I/O latency.
Limit distributed transactions (often implemented via 2‑PC). Aim to keep them below 10 % of total transactions to avoid latency and consistency issues.
Deep Dive: Database‑Level vs. Application‑Level Distribution
Two common approaches exist:
Database‑level distribution : the distributed DB handles sharding, distributed transactions and other complexities, presenting a unit‑based architecture to the application. This reduces application code complexity but shifts operational burden to the database.
Application‑level distribution : the application implements data partitioning, distributed transactions (e.g., TCC, saga) and scaling while the underlying DB remains centralized. This gives developers fine‑grained control over failure handling but increases application complexity.
Conclusion
Distributed databases provide high availability, elastic scaling and strong performance, but they introduce architectural complexity, higher operational costs, and limited support for stored procedures, functions and triggers. Successful adoption hinges on careful sharding‑key selection, avoiding cross‑node data movement, and keeping distributed transactions to a minimal proportion of total workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
