How REDtao Scaled Xiaohongshu’s Social Graph to Trillions of Edges
Xiaohongshu built the REDtao graph storage system to handle a trillion‑scale social graph, replacing MySQL with a three‑layer architecture, custom graph APIs, high‑availability caches, cross‑cloud multi‑active deployment, and cloud‑native operators, achieving over 90% cache hit rate and dramatic cost savings.
Background
Xiaohongshu is a youth‑focused lifestyle platform whose social graph contains billions of users, notes, products and their relationships. The existing MySQL‑based storage could not keep up with the read‑heavy workload, reaching 55% CPU at only a few million requests per second and requiring costly scaling.
In early 2021 the team launched a from‑scratch project to build REDtao, a graph storage system inspired by Facebook’s TAO, to provide a unified graph query API, high performance, and lower operational cost.
Graph Model and API
Relationships are stored as <FromId, AssocType, ToId> → Value(JSON). For example, a "follow" edge from user A to user B is represented as a triple with a JSON payload.
<FromId: A_ID, AssocType: follow, ToId: B_ID> → Value (JSON fields)Twenty‑five graph‑semantic APIs were exposed, covering CRUD operations and anti‑fraud filters. Typical usage examples include:
getAssocs("followed", userAId, offset, limit, onlyNormalUsers, orderDesc)– fetch all normal users following A. getAssocCount("followed", userAId, onlyNormalUsers) – count A’s followers while excluding cheating accounts.
Architecture Design
REDtao follows a three‑layer design: an access layer (SDK), a distributed cache layer, and a persistent MySQL layer. The cache layer is an independent cluster, decoupled from storage, allowing independent scaling and plug‑and‑play replacement of the MySQL backend.
Read flow: The client sends a request to a router, which hashes the edge triple to a follower node. The follower checks the local cache; on miss it forwards to the leader, which may query MySQL if the cache also misses.
Write flow: Writes follow the same routing to a follower, then to the leader, which writes to MySQL, invalidates the corresponding cache key, and propagates the invalidation to all followers.
High Availability
Both cache and storage layers are built as independent two‑tier clusters with leader/follower replication. Automatic fault detection, horizontal scaling, and cache‑only operation during storage failures ensure continuous service.
Rate‑limiting protects MySQL from cache‑miss storms, and a global version number per write prevents write‑conflict anomalies.
Performance
REDtao uses a three‑level nested hash table (from‑id → type → to‑id) with local secondary indexes and a time‑ordered list limited to the newest 1,000 edges per point, achieving high cache hit rates and low latency.
In production a 16‑core VM handles 1.5 million queries per second with only 22.5% CPU usage; a single node reaches 30 k QPS, each RPC aggregating ~50 queries.
Ease of Use
All 25 APIs abstract away SQL, providing a consistent programming model. A unified access URL hides the underlying cluster topology; the SDK routes requests based on edge type to the appropriate REDtao cluster.
Data Consistency
Writes generate a globally increasing version; cache updates compare versions to avoid stale overwrites. For strong‑consistency reads, clients can flag requests to be routed to the MySQL master.
Cross‑Cloud Multi‑Active
REDtao replicates MySQL binlogs across clouds for persistence and uses a DTS‑based subscription to invalidate caches, ensuring eventual consistency while allowing reads from any region.
Cloud‑Native Features
REDtao runs on Kubernetes with a custom Operator that creates a DuplicateSet resource to control shard placement, supports rolling upgrades, and automatically replaces failed pods.
Seamless Migration from Legacy MySQL
Migration was performed in stages: low‑priority services moved first, using a Tao Proxy SDK for dual‑write/dual‑read and data validation. After DTS‑based incremental sync, the SDK switched to read‑only from REDtao, and final consistency checks were run on binlogs.
The migration completed in early 2022 without downtime, moving trillions of edges and achieving a 21.3% cost reduction.
Results and Benefits
Post‑deployment, cache hit rate exceeds 90%, MySQL QPS drops by over 70%, and CPU usage falls dramatically. Overall infrastructure cost grew only ~15% while handling a 2.5× increase in request volume.
Conclusion and Outlook
REDtao demonstrates that a purpose‑built graph storage system can replace a massive MySQL deployment, delivering high performance, high availability, and cloud‑native operability. Future work includes merging REDgraph with REDtao into a unified graph database to serve broader internal scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
