Big Data 11 min read

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Facing massive, multi‑source traffic and the need for instant analytics, Kuaigou’s real‑time data warehouse evolved from Spark on‑premise to a cloud‑native stack using Alibaba Blink, Flink, and layered OLAP models, streamlining development, cutting costs, and enabling diverse real‑time applications.

ITPUB

Feb 7, 2023

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Background and Business Requirements

Kuaigou (快狗打车) experienced rapid growth, generating several terabytes of data daily from mini‑programs, apps, web, and H5 front‑ends. The business demanded low‑latency, real‑time insights for dashboards, reports, and online intelligent control, while the existing batch‑oriented BI pipeline could not keep up.

Previous Development Process and Real‑Time Computing

The original workflow required data‑product managers to collect requirements, hand them to developers, and manually build Kafka topics, Spark jobs, and downstream services. This "chimney" style development caused duplicated effort, tangled data lineage, high operational and machine costs, and poor data reuse.

The real‑time pipeline at that time simply read from DS to Kafka, consumed by Spark, and produced services, leading to task overload across risk control, traffic monitoring, and cockpit generation.

Cloud Migration and Architecture Evolution

In 2019 the team migrated both offline and real‑time warehouses to Alibaba Cloud. The stack shifted from Spark‑based processing to a cloud‑native Blink+Flink engine. Two strategic concepts were introduced: OneData for unified data handling and OneService for unified service exposure.

From 2020 onward, intelligent features such as auto‑tuning and smart operations were added on top of the real‑time platform.

Pain Points and Layered Model Solution

To eliminate chaotic development, a multi‑layered data model was adopted, mirroring offline design: ODS (raw data), DWS (service data), DWF (fact data), DWA (aggregated data), and DIM (dimension data). The model enforces identical schemas for batch and streaming jobs, reducing code duplication and operational overhead.

Flink replaced Spark as the primary compute engine, delivering higher efficiency and better resource utilization.

Template‑Based Development Workflow

Flink SQL reads fixed‑format Kafka sources; only topic and offset parameters vary.

Create views and apply a core UDF to unify offline and real‑time schemas, with strict validation at job start.

Handle stateful stream processing, paying attention to state size and resource allocation.

Output results to OLAP stores, MySQL, or Kafka; Kafka output also uses the core UDF for a fixed format.

Strict format control at both ingestion and egress stages ensures consistency between batch and streaming pipelines.

Data Integration and One‑Click Configuration

A one‑click configuration platform automates data subscription: users specify DTS identifier, database, table, and topic, and the system generates the necessary Kafka topics, formats the data, and forwards it to downstream pipelines, eliminating manual DBA coordination.

Operations such as topic creation, deletion, and offset management are also handled through the same UI, enabling template‑driven development without custom code.

Storage Systems and Real‑Time Query Engine

The final architecture combines three storage back‑ends: HBase + Elasticsearch for low‑latency queries, Alibaba Cloud ADB (cloud‑native data warehouse) for ad‑hoc analysis, and a jointly built Hologres cluster for PB‑scale, high‑concurrency OLAP with read‑write separation and federated queries.

Hologres supports unified real‑time + offline queries, providing a single data export point for both interactive analysis and API services.

Applications and Operational Use Cases

After the warehouse was built, dozens of applications were enabled, including HTTP APIs with table‑mapping for decoupled access, an internal interface management platform that auto‑generates API IDs and achieves minute‑level deployment, and comprehensive monitoring of response time, IP, and query frequency.

Real‑time risk control pipelines ingest messages, enrich them, and feed graph‑computing engines; a unified metric management system (OneData) handles both real‑time and offline metrics, offering lineage, versioning, and alerting with SQL‑based or algorithmic rules.

Future Outlook

The team aims to achieve a true stream‑batch convergence where a single logical system handles both workloads, reducing the current dual‑system isolation. Ongoing work includes dynamic rule engines for intelligent marketing and automated operational strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Migration Flink Real-time Data OLAP Spark

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.