Operations 21 min read

How Tencent Scaled Social Data Storage While Cutting Costs

Facing massive user growth, Tencent’s social network team redesigned its KV storage architecture—introducing CKV and Grocery, automating capacity planning, data migration, and backup reuse—to dramatically lower costs, improve operational efficiency, and maintain high service quality across millions of devices.

Efficient Ops

Dec 26, 2016

How Tencent Scaled Social Data Storage While Cutting Costs

Introduction

Social services at Tencent are a typical massive‑user, high‑concurrency platform. Since 2007 the SNG team has evolved from MySQL, Memcache, and custom C4a to self‑built NoSQL stores such as CKV and Grocery to meet low‑latency, high‑availability requirements.

Today the social business runs on nearly 20,000 storage devices serving QQ, Qzone, Photo, Music, and Membership, with a peak QPS close to 100 million. Roughly 10 % of the devices are relational stores and 90 % are NoSQL distributed stores.

Architecture Overview

CKV and Grocery are high‑performance KV distributed stores. Data is sharded to storage machines and addressed via a routing table. Each storage machine has a primary (read/write) and a standby (hot‑backup). The cluster consists of management nodes, access nodes, and storage nodes.

Access nodes handle routing and request forwarding; storage nodes hold the data; management nodes provide central configuration. Dynamic routing changes and fast data migration enable loss‑less scaling.

Background and Challenges

Rapid growth increased storage devices to over ten thousand by 2012, overwhelming manual operations such as cluster building, machine onboarding, capacity expansion, and fault handling. Four major problems emerged:

High cost : No lifecycle management, low utilization of standby machines, and mixed hot‑cold workloads.

Low operational efficiency : Manual table scaling (≈100 operations per day) and complex data migration workflows.

Service‑quality impact from failures : Frequent master‑standby switchovers, long migration times, and high alert volume.

Poor ops tooling : Existing tools were not designed from an ops perspective.

Cost Optimization

Goals include automatic data lifecycle scheduling, defining low‑load metrics for distributed storage, reusing idle standby resources, and making storage usage transparent.

We introduced a two‑tier storage model (memory cache + SSD) with LRU‑based hot‑data promotion and automatic migration of cold records to SSD. Access‑density metric (QPS per GB) guides capacity planning; optimal targets are 600‑800 QPS/GB for memory and 40‑50 QPS/GB for SSD.

For a storage machine offering 50 GB and a max QPS of 50 000, the access‑density limit is 1 000 operations/GB.

After deployment, 80 % of cold records moved to SSD, saving tens of thousands of memory devices and raising average access density from 200 to 600.

Standby Machine Reuse

Idle standby machines are virtualized into VMs using CAE, providing compute resources for non‑real‑time workloads. In master‑standby failover scenarios, the VMs are automatically destroyed and resources reclaimed.

Challenges include kernel incompatibility, limited memory (max 4 GB per VM), and ensuring automatic VM teardown after failover.

Memory Fragmentation Management

CKV partitions a large shared memory segment into fixed‑size blocks (default 86 bytes). A free‑list chain tracks available blocks. When block size mismatches record length, fragmentation rises.

By sampling live records and applying an 80 % middle‑range packing algorithm, we derived optimal block sizes, migrated data, and freed unused blocks, reducing fragmentation by 5 % and saving hundreds of machines.

Data‑Ops Efficiency

Automation includes:

Auto scaling : Capacity triggers expansion at 93 % usage and contraction at 70 %, keeping storage at ~85 % occupancy.

High‑throughput data migration : Multi‑process concurrent migration with customizable throttling, cutting migration time by half.

Load auto‑balancing : Scheduler redistributes hot workloads across machines based on real‑time resource metrics.

Migration strategies include machine‑to‑machine swaps, intermediate relays, and scatter‑gather distribution.

Service Quality Improvements

Key quality issues stem from master/standby failures, pipeline congestion, kernel latency spikes, hotspot reads, network problems, process restarts, connection exhaustion, high load, NIC downgrades, and empty queries.

Mitigations include faster master‑standby switchover, automated data migration, enhanced sync processes, multi‑process engine redesign, and a fault‑self‑healing system that detects anomalies, triggers automated remediation, and escalates only when necessary.

DaaS Service Capability

The platform now offers Data‑as‑a‑Service, providing lifecycle management, on‑demand provisioning, automatic tiering, isolation, and self‑healing. This enables product teams to focus on core innovation while the underlying storage, monitoring, and scaling are handled autonomously.

These capabilities have been extended to Tencent Cloud’s CRS (Cloud Redis Store) for external customers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Operations scalability cost optimization storage

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.