Vivo Real-Time Computing Platform: Architecture, Practices, and Applications
The Vivo Real‑Time Computing Platform, built on Apache Flink, delivers a one‑stop data construction and governance solution that processes up to 5 PB daily, offering high‑availability submission and control services, robust stability, rich SQL usability, efficient Kubernetes deployment, strong security, and supports real‑time warehouses and short‑video recommendation, while targeting future elastic scaling and lake‑house unification.
Author: vivo Internet Real-Time Computing Team – Chen Tao.
This article is compiled from Chen Tao’s presentation at the 2022 vivo Developer Conference. The Vivo Real-Time Computing Platform is a one‑stop data construction and governance platform built on Apache Flink, covering real‑time stream ingestion, development, deployment, operation, and maintenance.
In 2022, vivo Internet had 280 million users, with several apps reaching daily active users of tens of millions or even over 100 million. To support such scale, the platform processes up to 5 PB of data per day, handling more than 4 000 tasks across 98 projects, with a year‑over‑year growth exceeding 100 %.
The platform’s construction timeline:
Late 2019 – Initiated platform building.
2020 – Focused on stability, launched initial SQL capabilities.
2021 – Integrated Flink 1.13 and began containerization.
2022 – Emphasized efficiency (stream‑batch integration, task diagnostics).
Core Services
The backend architecture includes two core services:
SubmissionServer : Handles job submission and interacts with the resource manager, offering high availability, scalability, multi‑version Flink support, and various job types.
ControlServe : Maintains task runtime states via a built‑in state machine with nine defined states, providing second‑level state update latency.
Additional foundational services are a unified metadata service (HiveMetaStore with TiDB extensions) and a real‑time monitoring & alarm service built on Flink CEP.
Stability Construction
Key improvements include:
Upgrading HDFS to version 3, optimizing Flink sink performance, and building a Spark‑based small‑file merging service.
Enhancing Kafka’s load‑balancing and dynamic throttling, and improving Flink’s tolerance to Kafka broker restarts.
Strengthening Flink HA on Zookeeper (versions 1.10/1.13) and adding comprehensive ZK monitoring.
These measures reduced task failures and improved overall stability.
For the content‑recommendation scenario, RocksDB was adopted as the state backend, with custom monitoring metrics and version upgrades, enabling TB‑scale state stability and improving recommendation effectiveness.
Usability Construction
From a development perspective, a feature‑rich FlinkSQL environment was provided, extending window triggers, DDL compatibility, and connector/UDF capabilities. A standalone SQL debugging cluster offers data sampling, DAG visualization, and real‑time result display, raising SQL job share from 5 % to 60 %.
From an operations perspective, end‑to‑end lineage and latency monitoring were built, allowing rapid pinpointing of abnormal task nodes without restarting jobs.
Efficiency Improvements
To reduce cost, the platform migrated from YARN to Kubernetes, leveraging native K8s support in Flink for containerized JAR deployment, achieving finer‑grained resource isolation and higher utilization.
Lake‑house integration was realized using Hudi for unified storage, Flink for unified ingestion, and HMS for unified metadata, enabling a stream‑batch unified architecture.
A diagnostic service was built to collect runtime metrics, logs, GC data, and configuration, applying heuristic rules to provide optimization suggestions for resource and anomaly diagnosis.
Security Capability Construction
Security measures include data classification, audit logging, column‑level transparent encryption for offline storage, and automatic detection of sensitive fields in real‑time data. Ongoing research on DSMM aims to further enhance data security.
Application Scenarios
1. Real‑Time Data Warehouse : A layered architecture (ODS → DWD → DWS → ADS) built on the vStream platform supports reporting, marketing, recommendation, and advertising across multiple business lines, with ClickHouse as the final OLAP engine.
2. Short‑Video Real‑Time Recommendation : TB‑scale state tasks enable timely feature computation and sample stitching, supporting both real‑time and offline pipelines.
Exploration and Outlook
Future work focuses on elasticity at both task and cluster levels (Flink AutoScaling, mixed‑mode offline containers) and further lake‑house unification to reduce duplicated computation and storage, aligning with cloud‑native, data‑lake, and next‑generation compute trends.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.