Designing Cloud‑Native Distributed Database Architecture: Insights from TiDB
This article explores how to design a cloud‑native distributed database architecture using TiDB as a case study, discussing cost reduction, storage choices, compute‑storage separation, fault tolerance, scaling strategies, and the trade‑offs of caching and persistent storage in modern cloud environments.
Introduction The talk introduces the concept of cloud‑native databases, emphasizing that customers seek elastic resources, pay‑as‑you‑go pricing, and flexible service‑level options rather than a generic cloud‑native label.
Understanding Cloud‑Native Cloud‑native is viewed as a means to leverage cloud infrastructure to lower total cost of ownership, not just a buzzword. For a third‑party database vendor, reducing costs requires using cloud provider features effectively.
TiDB Architecture Overview TiDB consists of stateless TiDB processes that perform computation and stateful TiKV nodes that store data and handle transactions, with Raft ensuring consistency. The compute‑storage boundary currently lies between TiDB and TiKV.
Limitations of Existing Design The current TiDB/TiKV design targets IDC environments and does not fully exploit cloud capabilities. Moving the compute‑storage separation to the boundary between TiKV and persistent storage can better utilize cloud services.
Storage Options: EBS vs. S3 EBS offers high IOPS comparable to local disks but at higher cost, while S3 provides cheap, virtually unlimited capacity with 11‑nines durability but higher latency. A hybrid approach can combine the strengths of both.
Proposed Cloud‑Native Architecture Store persistent data in S3 (or other cloud storage) and keep a cache layer co‑located with TiDB processes. Writes go to the cache first and are asynchronously flushed to S3; reads prefer the cache and fall back to S3. Transaction logs (WAL) are written synchronously and replicated across AZs for durability.
Fault Tolerance and Disaster Recovery Using S3 as the default storage provides cross‑AZ durability, eliminating the need for three‑copy replication of TiDB nodes. WAL replication enables cross‑region failover, simplifying disaster recovery.
Scaling and Expansion New nodes can asynchronously load data from S3 without disrupting existing services, avoiding costly data rebalancing. Cache size and placement can be tuned to meet latency requirements while controlling costs.
Cache‑Related Trade‑offs Cache miss (cache‑hit‑rate) impacts latency; offering different service tiers with varying cache guarantees allows price differentiation. The architecture sacrifices the flexible TiDB‑TiKV node ratio for predictable performance.
Data Format and HTAP Considerations While the current KV model uses LSM‑Tree, the new design can adopt more compact formats or columnar storage for HTAP workloads, leveraging asynchronous S3 writes and compaction without affecting compute nodes.
Multi‑Tenant Deployment Deploying the combined TiDB‑cache process as a pod in Kubernetes/EKS enables multi‑tenant isolation at the process level, simplifying resource and security isolation.
Conclusion The proposed architecture leverages cloud storage, asynchronous I/O, and caching to achieve lower cost, better scalability, and easier disaster recovery for cloud‑native distributed databases, though practical implementation details will determine its success.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
