Mastering Horizontal Scaling with TDSQL: Design, Practices, and Performance
This article explains the motivations, challenges, and design principles of horizontal scaling for databases, using TDSQL as a case study to illustrate architecture, scaling procedures, shard key selection, high‑availability mechanisms, distributed transactions, and performance optimization techniques.
Background and Challenges of Database Horizontal Scaling
Horizontal scaling is needed when business traffic or data volume exceeds the capacity of a single node, causing insufficient TPS, QPS, latency, or resource limits such as disk and network bandwidth. Compared with vertical scaling, which upgrades a single machine, horizontal scaling adds more machines to meet demand.
Horizontal Scaling vs Vertical Scaling
Vertical scaling increases CPU, memory, or storage of a single instance, often requiring a master‑slave switch in MySQL. Its limitation is dependence on a single machine’s resources, which eventually cannot satisfy rapid growth.
Horizontal scaling solves this by allowing theoretically unlimited expansion through adding nodes, but introduces complexity such as data partitioning, hotspot handling, data migration, routing changes, rollback, consistency, and performance linearity.
TDSQL Horizontal Scaling Practice
TDSQL Architecture
TDSQL consists of a SQL engine layer that abstracts storage details, a data storage layer composed of multiple SETs (each can be a primary‑multiple‑replica configuration), and a Scheduler module that monitors and controls the cluster, handling scaling and failover without business impact.
TDSQL Horizontal Scaling Process
The process starts with data initially split into 256 shards on a single node. Scaling moves these shards to additional nodes, increasing the number of SETs from 1 to up to 256, while the shard count remains constant. The UI allows one‑click addition of SETs and routing adjustments.
Design Principles Behind TDSQL Horizontal Scaling
Shard Key Selection for Compatibility and Performance
When creating tables, users specify a shard key field. This enables balanced data distribution and keeps related data on the same node, reducing cross‑node traffic. If no shard key is provided, TDSQL chooses one randomly, which may degrade performance.
High Availability and Reliability During Scaling
Data synchronization – a new instance is created and synchronized in real time without affecting business.
Data verification – continuous sync and verification until lag is within a small threshold (e.g., 5 seconds).
Routing update – writes are briefly frozen (seconds) while the new node catches up, then routing is switched atomically.
Redundant data deletion – delayed deletion avoids I/O spikes and ensures consistency.
Distributed Transactions
After scaling, data spans multiple nodes. TDSQL uses two‑phase commit to guarantee atomicity across nodes, making distributed transactions transparent to applications. The system is decentralized, allowing linear performance growth.
Achieving Linear Performance Growth
Key techniques include keeping related data on the same node via shard key design, parallel execution and stream aggregation across nodes, push‑down predicates to reduce data transfer, and data redundancy to minimize cross‑node access.
Practical Cases and Guidelines
Choosing a Shard Key
Typical choices are user ID for internet services, player ID for games, buyer/seller ID for e‑commerce, or device ID for IoT. A good shard key ensures balanced distribution and enables queries that include the key to be routed to a single node.
When to Scale
Scaling decisions are driven by monitoring metrics such as disk usage, CPU, or QPS. Thresholds (e.g., 80 % CPU) trigger scaling, and anticipated traffic spikes (e.g., upcoming promotions) can prompt proactive scaling.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
