Big Data 13 min read

How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics

Qimao, a Shanghai‑based cultural entertainment internet firm, details its 20 PB big‑data architecture built on StarRocks, Flink, Hive, and Redis, covering data ingestion, real‑time processing, audience selection, metric anomaly drill‑down, 730‑day aggregation, and future plans for metric acceleration and full‑link data governance.

Alibaba Cloud Big Data AI Platform

Sep 13, 2024

How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics

Company Overview

Qimao is an internet company focused on cultural entertainment, headquartered in Shanghai. It launched the Qimao Chinese literature website in May 2017 and the free novel app in August 2018, now serving over 600 million users and ranking among the top digital‑reading platforms.

Big Data Architecture

The data‑warehouse team is responsible for offline and real‑time data development, metric construction, and data governance across all business lines. Since joining, the team introduced StarRocks and now operates five StarRocks clusters in production, handling roughly 20 PB of combined offline and real‑time data.

User‑behavior logs (exposures, clicks, reading time, likes, comments, ad views) are collected via log‑pointing, split into two Kafka topics, and consumed in real time by Flink. Flink writes the data into Hive tables, where a series of ETL jobs load it into StarRocks and Redis. StarRocks provides low‑latency analytical queries, while Redis handles high‑concurrency scenarios. Downstream services such as a unified BI system, metric management, A/B testing, and the general audience selection system consume these data services.

The middle layer of the warehouse stores core data, with two branches for market‑placement and commercialization business lines. These streams generate millions of records per second and are aggregated every five minutes via Flink CDC or Flink Connector into StarRocks, enriching other user‑behavior data for downstream consumption.

Other eight business lines also expose data‑as‑a‑service through a unified data‑asset management platform, offering capabilities such as data maps, self‑service metric analysis, data governance, and A/B testing.

General Audience Selection System

Before the system launch, the data‑warehouse team received 2‑3 audience‑selection requests per week; after launch, the frequency dropped to one request every 2‑3 months. Analysts write custom SQL to create audience packages, which are registered in the selection system. A job‑scheduling API then automatically generates and schedules selection jobs, handling upstream and downstream dependencies.

The first version stored audience data primarily in Redis, with a copy in Hive for offline analysis. To reduce cost, StarRocks replaced Redis for most use cases, offering larger storage capacity at lower cost. The scheduling system allows developers to choose between StarRocks, Redis, or Hive as the execution engine.

Screenshot of the audience selection interface.

Automatic Metric Anomaly Drill‑Down

This in‑house feature automatically discovers the root cause of metric anomalies across dimensions, reducing investigation time from half a day to under five minutes. It first predicts normal metric ranges using historical data; values outside the range are flagged as anomalies.

Anomaly nodes are linked in a tree structure, allowing step‑by‑step drill‑down to the finest‑grained dimension and identifying the responsible business owner.

730‑Day Flexible Drill‑Down Analysis

Frequent user‑tag updates caused inconsistencies in historical analysis. To maintain continuity, the team recomputes the past 730 days of data daily using the latest tags. With ~5 billion users and 150 billion metric records, the naive approach was infeasible.

The solution compresses 730 daily records into a single row in Hive, imports the reduced dataset into StarRocks, joins it with the latest tag table, groups by required dimensions, and creates a materialized view. Refreshing the view each day yields sub‑second query performance and resolves tag‑drift issues.

Future Outlook

Metric Acceleration Layer

Currently, after building models and metrics in Hive, data must be manually synced to StarRocks and registered in the BI system. The plan is to build a metric acceleration layer in the data‑asset management platform, using StarRocks + Kylin as the core OLAP engine. Developers will define models and metrics once, and the system will automatically generate BI datasets and accelerate data in StarRocks and Redis.

Full‑Link Data Governance for Log Points

Since March 2024, a data‑governance project has reduced monthly costs by ¥700,000. Work includes developing governance tools that auto‑detect low‑utilization tables and tasks, implementing blacklist killing for stale jobs, archiving cold partitions to object storage (cost ≈ 10 % of standard storage), and decommissioning low‑degree‑of‑freedom tables using data‑lineage.

Compute‑resource governance leverages Spark elastic scaling and YARN monitoring to shrink the number of permanent Spark nodes.

Future steps will extend governance to StarRocks, including usage‑based table retirement, hot‑cold data migration between EMR‑StarRocks and Hive, and automated data movement between local and object storage based on statistics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink StarRocks Data Warehouse Data Governance

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.