Evolution of Vivo's Trillions-Scale Data Architecture: Dual-Active Real-Time and Offline Computing
Vivo’s trillion‑scale data platform evolved into a dual‑active real‑time and offline architecture that leverages multi‑datacenter clusters, Kafka/Pulsar caching, a unified sorting layer, HBase‑backed dimension tables, and micro‑batch Spark jobs to deliver low‑cost, high‑performance processing, 99.9% availability, and 99.9995% data‑integrity.
This article, based on Liu Kaizhou’s talk at the 2023 Vivo Developer Conference, summarizes the evolution of Vivo’s foundational data architecture driven by trillion‑scale data growth. It describes how the company built a stable, reliable, low‑cost, high‑performance dual‑active (real‑time + offline) computing architecture to meet business development, data‑quality, and cost challenges.
Key objectives of the foundational data platform include ensuring timely and accurate data, high computational performance, low resource cost, strong disaster‑recovery capability, and high development efficiency.
1. Data growth and challenges
• Daily record volume grew from hundreds of billions to trillions, storage from GB to PB, and concurrent QPS reached millions.
• Computing scenarios expanded from batch to near‑real‑time, real‑time, and stream‑batch integrated models.
• Performance requirements tightened: end‑to‑end latency needed to drop from hours to seconds; offline batch size grew from GB to >10 TB per hour, outpacing architecture iteration.
2. Architectural practice
The new architecture adopts multi‑datacenter, multi‑cluster, dual‑active disaster‑recovery links, supporting various periodicities (seconds, minutes, hours, days). Offline collection now copies logs directly to HDFS, writes ODS tables, and co‑exists with the real‑time pipeline, enabling failover. A sorting layer aggregates consumption to avoid duplicate processing.
2.1 Dual‑link design
Real‑time data is cached in Kafka or Pulsar for 8‑24 h; offline data is cached in HDFS for 2‑7 days. This separation reduces storage and compute redundancy and improves disaster recovery.
2.2 Real‑time performance optimization
A unified sorting layer consumes each topic once and routes required subsets to downstream consumers, cutting duplicate consumption and saving >30 % of resources. To address Redis batch‑scale‑out latency, the team introduced HBase 2.0 as the primary dimension‑table cache for latency‑sensitive workloads, using Redis as a fallback.
2.3 Offline performance optimization
The team adopted a micro‑batch model inspired by stream processing: data is written to HDFS at minute granularity, and Spark jobs are triggered when a batch reaches a size threshold (e.g., 1 TB). This reduces per‑job data volume and shuffle overhead.
Dimension tables were transformed from full‑size (4‑10 B rows) to incremental tables (millions of rows) stored in HBase, and Join strategy switched from Sort‑Merge Join to Shuffle‑Hash Join, achieving >60 % performance gain.
2.4 Data integrity
Three‑layer real‑time reconciliation (ingestion, ETL, component output) and multiple validation methods (short‑term同比/环比, historical trend algorithms, latency‑drift detection, holiday patterns, operation‑feature windows) raise anomaly detection accuracy from 85 % to 99 % and reduce issue‑locating time from days to minutes.
3. Summary and outlook
The upgraded dual‑active architecture now supports data volumes from hundreds of billions to tens of trillions, with SLA availability at 99.9 % and data‑integrity SLA at 99.9995 %. Future plans focus on further lake‑warehouse integration, improving offline collection resilience, migrating shuffle to RSS, and enhancing real‑time metadata‑driven operations.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.