Big Data 16 min read

Evolution of Vivo's Trillions-Scale Data Architecture: Dual-Active Real-Time and Offline Computing

Vivo’s trillion‑scale data platform evolved into a dual‑active real‑time and offline architecture that leverages multi‑datacenter clusters, Kafka/Pulsar caching, a unified sorting layer, HBase‑backed dimension tables, and micro‑batch Spark jobs to deliver low‑cost, high‑performance processing, 99.9% availability, and 99.9995% data‑integrity.

vivo Internet Technology

Jan 24, 2024

Evolution of Vivo's Trillions-Scale Data Architecture: Dual-Active Real-Time and Offline Computing

This article, based on Liu Kaizhou’s talk at the 2023 Vivo Developer Conference, summarizes the evolution of Vivo’s foundational data architecture driven by trillion‑scale data growth. It describes how the company built a stable, reliable, low‑cost, high‑performance dual‑active (real‑time + offline) computing architecture to meet business development, data‑quality, and cost challenges.

Key objectives of the foundational data platform include ensuring timely and accurate data, high computational performance, low resource cost, strong disaster‑recovery capability, and high development efficiency.

1. Data growth and challenges

• Daily record volume grew from hundreds of billions to trillions, storage from GB to PB, and concurrent QPS reached millions.

• Computing scenarios expanded from batch to near‑real‑time, real‑time, and stream‑batch integrated models.

• Performance requirements tightened: end‑to‑end latency needed to drop from hours to seconds; offline batch size grew from GB to >10 TB per hour, outpacing architecture iteration.

2. Architectural practice

The new architecture adopts multi‑datacenter, multi‑cluster, dual‑active disaster‑recovery links, supporting various periodicities (seconds, minutes, hours, days). Offline collection now copies logs directly to HDFS, writes ODS tables, and co‑exists with the real‑time pipeline, enabling failover. A sorting layer aggregates consumption to avoid duplicate processing.

2.1 Dual‑link design

Real‑time data is cached in Kafka or Pulsar for 8‑24 h; offline data is cached in HDFS for 2‑7 days. This separation reduces storage and compute redundancy and improves disaster recovery.

2.2 Real‑time performance optimization

A unified sorting layer consumes each topic once and routes required subsets to downstream consumers, cutting duplicate consumption and saving >30 % of resources. To address Redis batch‑scale‑out latency, the team introduced HBase 2.0 as the primary dimension‑table cache for latency‑sensitive workloads, using Redis as a fallback.

2.3 Offline performance optimization

The team adopted a micro‑batch model inspired by stream processing: data is written to HDFS at minute granularity, and Spark jobs are triggered when a batch reaches a size threshold (e.g., 1 TB). This reduces per‑job data volume and shuffle overhead.

Dimension tables were transformed from full‑size (4‑10 B rows) to incremental tables (millions of rows) stored in HBase, and Join strategy switched from Sort‑Merge Join to Shuffle‑Hash Join, achieving >60 % performance gain.

2.4 Data integrity

Three‑layer real‑time reconciliation (ingestion, ETL, component output) and multiple validation methods (short‑term同比/环比, historical trend algorithms, latency‑drift detection, holiday patterns, operation‑feature windows) raise anomaly detection accuracy from 85 % to 99 % and reduce issue‑locating time from days to minutes.

3. Summary and outlook

The upgraded dual‑active architecture now supports data volumes from hundreds of billions to tens of trillions, with SLA availability at 99.9 % and data‑integrity SLA at 99.9995 %. Future plans focus on further lake‑warehouse integration, improving offline collection resilience, migrating shuffle to RSS, and enhancing real‑time metadata‑driven operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing HBase Data Architecture data integrity Offline Computing

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.