How Tencent Powers Millions with SET‑Based NoSQL Clusters
Tencent’s operations team explains how its SET‑based NoSQL clusters deliver ultra‑low latency, high availability, and seamless disaster recovery for billions of users, detailing deployment models, synchronization mechanisms, cost‑saving techniques, and the Data‑as‑Service approach that underpins its massive social platforms.
Speaker Profile
Zhou Xiaojun, senior operations engineer at Tencent, leads massive NoSQL cluster operations for social services, with over a decade of experience in data center, cloud computing, and automation.
Topic Overview
Tencent operates three major NoSQL distributed storage systems: Grocery (supporting QQ services), CKV (supporting QQ Space, Cloud, etc.), and Quorum_KV (supporting WeChat services). The data operations team manages CKV and Grocery clusters across thousands of servers in multiple regions.
Deployment Model
NoSQL clusters are organized into SETs (storage warehouses), each a physical unit with four server roles: access machines, storage machines (primary and standby), warehouse management machines, and migration machines.
Each SET can be deployed across racks, data centers, or cities, providing container‑like isolation and disaster‑tolerant capabilities.
SET‑Based Disaster Recovery Example
During the August 12 Tianjin explosion, the team migrated over 200 million active users from the affected data center to Shenzhen and Shanghai without user impact, demonstrating the power of SET‑based management and multi‑site synchronization.
Data Synchronization
Business data is synchronized between geographically distributed warehouses via a synchronization center, ensuring consistency and rapid failover when an IDC experiences a catastrophic failure.
Technical Characteristics
Low Cost: Hot data resides in memory, cold data on SSD, keeping less than 20 % of data in memory.
Scalability: Online, lossless storage expansion with no service disruption.
High Performance: Up to tens of millions of operations per second, with ~1 ms latency on gigabit networks.
Availability >99.95 %: Full redundancy, active‑standby failover, cross‑rack deployment.
Data durability >99.999999 %: Multi‑copy storage on memory and disk with disaster‑time rollback.
High‑Availability Architecture
Reliability: Primary‑standby redundancy with automatic failover and migration services.
Geographic Disaster Recovery: Multi‑region deployment ensures continuous service.
Strong Consistency: Primary handles reads/writes; standby provides disaster backup, switching to read‑only on primary failure.
Warehouse Cluster Mechanism: Standardized deployment, automatic capacity scaling, and adaptive service availability.
Data‑as‑a‑Service (DaaS) Vision
The data center adopts an IaaS model, transforming compute, storage, and networking into pooled resources. For the data layer, the goal is DaaS—delivering data as a service to users.
Building a Scalable Distributed Database
Storage resources are pooled and managed as fixed units, with an intelligent scheduling system enabling dynamic scaling from 1 GB to 10 TB per business, minute‑level memory expansion, four‑nine availability, and 2 ms latency without manual intervention.
Operations as a Service
Self‑service onboarding: Automated business ID creation, table space provisioning, and lifecycle management.
Automated deployment: Package installation and one‑click provisioning across racks.
Elastic scaling: Automatic adjustment of storage proxies and allocation based on usage.
Water‑level scheduling: Automatic traffic flow between access clusters and storage blocks.
Comprehensive reporting: Access trends, storage trends, hot‑cold distribution, and load metrics.
Multi‑protocol support: Private, Redis, and Memcache protocols.
Cost allocation: Monthly accounting by request volume and storage usage for transparent billing.
Cost Optimization Strategies
Use access density (requests per GB) as a measurable cost metric.
Regularly defragment storage blocks to improve space efficiency.
Tiered storage: hot keys in memory, cold keys on SSD, reducing memory footprint.
Repurpose standby machines for container workloads, maximizing resource utilization.
Nature of the Operations Team
R&D and DBA collaboration is like building a car and writing its manual; the DBA tunes and maintains it for optimal performance.
The operations team works closely with R&D to refine database engines for easier maintenance, higher performance, and broader feature support, continuously pursuing excellence in cost, security, quality, and efficiency.
Team members include development engineers (CI pipelines, APIs, code review) and operations engineers (product‑level development and task execution). A rich set of APIs—authentication, workflow engine, CMDB, monitoring, logging, package installation—greatly boosts tool development efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
