Big Data 15 min read

How WeChat Achieved Sub‑Second Real‑Time Analytics with StarRocks Lakehouse

WeChat transformed its data platform from Hadoop and ClickHouse to a StarRocks‑based lakehouse, tackling massive data volume, ultra‑low latency, and storage fragmentation by deploying lake‑on‑warehouse and warehouse‑lake fusion architectures, real‑time incremental materialized views, and unified SQL access, resulting in dramatic cost cuts and performance gains.

StarRocks

Dec 19, 2023

Background

WeChat originally used a Hadoop‑based analytics stack, which suffered from slow queries, high latency, and a bulky batch‑plus‑stream separation. To meet growing personalization demands, an ultra‑low‑latency OLAP warehouse built on ClickHouse was introduced, achieving sub‑second responses for billions of rows but still leaving data silos and duplicated storage.

Unified Real‑Time Requirement

The target is a single stack that provides both sub‑second and minute‑level query latency through a consistent SQL interface, so users no longer need to distinguish between “real‑time” and “ultra‑fast” back‑ends.

Lakehouse Architecture Options

Two technical routes were evaluated:

Lake‑on‑Warehouse (Lakehouse) : Introduce Delta Lake, Hudi, Iceberg, or Hive 3.0 on top of Hadoop, add a SQL‑on‑Hadoop engine (Presto/Impala), and use Hive Metastore for unified metadata. In WeChat this evolved from Presto + Hive to StarRocks + Iceberg, cutting query latency from minutes to seconds for roughly 80 % of large queries; the remaining massive queries are handled by Spark.

Warehouse‑Lake Fusion : Add cross‑source federated query capabilities to the data warehouse, allowing direct analysis of lake data without ETL. Data is first ingested into the warehouse, then cold‑stored to the lake via a meta‑server that unifies metadata. This yields seconds‑to‑two‑minute latency but incurs higher cost and reduced Hadoop compatibility.

WeChat adopted a hybrid solution that combines lake‑on‑warehouse for cost‑effective offline analysis with warehouse‑lake fusion for real‑time workloads, letting users switch based on performance and cost requirements.

Real‑Time Incremental Materialized Views

StarRocks originally supported two MV types:

Asynchronous MV : Refreshes periodically or manually, requiring a full INSERT OVERWRITE of partitions. This is costly for large tables and unsuitable for real‑time scenarios.

Synchronous MV : Refreshes instantly with data writes but is hidden from users, limited to simple aggregations, and disallows complex expressions, column aliases, and joins.

To satisfy WeChat’s high‑throughput, real‑time needs, an incremental MV framework was designed with the following features:

Decoupling of source ODS tables (3‑7 days retention) from MV result DWS tables (6‑12 months retention).

Multi‑stream synchronous MV that writes computation results to a shared target table, enabling metric stitching across multiple source tables.

Global dictionary support for dimension‑table joins, eliminating upstream Flink jobs and accelerating BI query joins.

The roadmap progresses from multi‑stream sync MV → global‑dictionary joins → streaming MV (in development) → lake‑on‑warehouse incremental MV, aiming to add JOIN support and generic aggregation functions.

Deployment Results

The solution is deployed in dozens of WeChat business scenarios (video live, keyboard, reading, public accounts) on clusters of several hundred machines, with data ingestion approaching a trillion rows. In a live‑streaming use case, operational tasks for data developers were cut by 50 %, storage costs dropped over 65 %, and offline job turnaround time shortened by two hours.

Future Work

Future efforts focus on refining the hybrid architecture to achieve fully unified SQL interaction, and extending streaming materialized view capabilities to support more complex analytics while maintaining low latency and cost efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

big data StarRocks WeChat Lakehouse materialized view

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.