Databases 14 min read

How Yidian Tianxia Built a Unified Real‑Time & Offline Data Warehouse with StarRocks

Yidian Tianxia tackled massive daily data volumes and complex analytics by defining a five‑layer data‑warehouse standard, comparing ClickHouse and StarRocks performance, and implementing a unified real‑time/offline architecture with StarRocks, DataPlus, and EasyJob, achieving multi‑fold query speedups and lower operational costs.

StarRocks
StarRocks
StarRocks
How Yidian Tianxia Built a Unified Real‑Time & Offline Data Warehouse with StarRocks

Background and Pain Points

Yidian Tianxia, a technology‑driven global marketing service provider, serves over 5,000 customers including major internet firms. Their data platform processes dozens of terabytes and billions of records daily, facing growing data‑processing demands, complex analytical metrics (e.g., retention, LTV), a fragmented stack of components (ClickHouse, Kafka, Flink, Spark, Hive), and weak real‑time capabilities.

Data‑Warehouse Standardization

The team designed a unified data‑warehouse specification covering five dimensions:

Data layering : ODS (source), DWD (detail), DWS (summary), ADS (application), and DIM (dimension).

Business and domain definitions : Clarify the scope and business types of data processed.

Metric taxonomy : Classify atomic (click, view, amount), composite (CTR, bounce rate), and derived metrics (7‑day spend, yearly balance).

Modeling standards : Naming, storage, and data conventions to improve maintainability and data quality.

Model evaluation criteria : Naming consistency, data completeness, growth ratio of intermediate tables, cross‑layer access, and shared logic reduction.

Technology Selection and Performance Benchmark

After standardization, the team evaluated mainstream database products. They benchmarked ClickHouse against StarRocks on a 600 million‑row dataset using a 16‑core, 64 GB cloud instance. StarRocks leveraged vectorized execution, a sophisticated CBO optimizer, materialized views, and runtime filters, delivering query times 2.26 × faster than ClickHouse across various SQL workloads.

Unified Real‑Time & Offline Architecture

The production pipeline loads both streaming and batch data into StarRocks. Real‑time ingestion uses broker load triggered by the EasyJob scheduler; offline loads are performed via scheduled EasyJob jobs. DataPlus, a self‑developed data‑governance platform, monitors data quality, maintains metadata lineage, and automates model generation. The architecture supports second‑level ingestion latency, ACID guarantees, and Snapshot Isolation.

Intelligent Data Modeling

Using metadata and lineage, the team automated model generation, producing standard SQL that runs in StarRocks. Key techniques include:

Materialized views : Accelerate DWS‑level data, support single‑table sync, multi‑table async, and transparent SQL rewriting.

Analysis models : Standardized high‑order functions improve query performance by >50 % for common analytical scenarios.

Retention analysis : The retention function simplifies multi‑date user‑behavior queries.

Funnel analysis : window_funnel efficiently computes conversion funnels across event sequences.

Path analysis : Window functions ROW_NUMBER(), LEAD(), LAG() enable before‑after behavior tracing.

Construction Outcomes

Established data‑warehouse standards and completed technology research.

Performance testing showed StarRocks delivering >2.2 × speedup over ClickHouse, with minute‑level load times for hourly data.

Pilot deployment yielded sub‑5‑second response for complex queries and interactive SQL self‑service.

Full rollout integrated StarRocks across all data products, with automated monitoring and ops.

After several months in production, StarRocks has become the core of the BI system, and the company plans to extend its use to all data scenarios, expecting further performance gains and cost reductions.

Real-time analyticsStarRocksPerformance TestingClickHouseData WarehouseData Governance
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.