Big Data 25 min read

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

This article details how Weibo’s advertising team designed and implemented a real‑time data platform capable of processing over a hundred billion daily logs, covering technology selection, Flink advantages, architecture evolution, data processing pipelines, component libraries, fault‑tolerance strategies, and the construction of a multi‑layer real‑time data warehouse.

dbaplus Community

Oct 22, 2019

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

Background and Motivation

With the rapid expansion of Weibo’s advertising business, the volume of various business logs grew dramatically, making traditional Hadoop‑based offline storage and computation insufficient for latency‑sensitive use cases. To meet real‑time requirements, the team built a platform that now handles more than one hundred billion logs per day across multiple product lines.

Technology Selection

Although Spark offers a mature ecosystem and strong machine‑learning integration, Flink was chosen for its true stream‑processing model, lightweight fault‑tolerance, and higher throughput. Compared with Storm, Flink’s checkpoint mechanism imposes far less overhead.

Comparison of real‑time computation frameworks

Key Flink Features

State Management : Flink’s checkpoint mechanism guarantees exactly‑once semantics, preserving application state across failures.

Event Time & Windowing : Supports event‑time processing and flexible windows (time, count, session) for complex stream patterns.

Fault Tolerance : Lightweight checkpointing enables rapid recovery without data loss.

High Throughput & Low Latency : Benchmarks show Flink outperforms Storm in distributed counting tasks.

Architecture Evolution

Initial Architecture : Two‑layer (compute + storage) with a separate Flink job for each new requirement, leading to low code reuse and duplicated storage.

Later Architecture : Introduced a five‑layer model (Access, Compute, Storage, Service, Application) built on Flink, ClickHouse, HBase, Redis, MySQL, etc., with unified configuration‑driven components.

Data Processing Flow

Raw logs are ingested (Kafka, RabbitMQ, files), cleaned and enriched in the Compute layer (Flink), then stored in a multi‑layer real‑time warehouse (ClickHouse, HBase, Redis, MySQL). Aggregation micro‑services provide 5‑minute, 10‑minute, and hourly metrics, while feature‑extraction services push data to downstream stores.

Challenges and Optimizations

Late‑arriving logs: combined real‑time + offline back‑fill using Kafka for unjoined records and Hive for batch reconciliation.

Flink performance: operator chaining, async I/O, checkpoint storage tuning (Memory, HDFS, RocksDB), incremental snapshots, and resource allocation.

Task stability: restart strategies (fixed delay, failure rate, no restart), HA via ZooKeeper, comprehensive monitoring of JobManager/TaskManager metrics.

Data Association Component

Three association strategies were evaluated:

Flink Table API – limited by memory for large windows.

RocksDB state backend – viable but slower on HDD.

External storage (HBase) – chosen for scalability.

Interval Join was introduced to reduce HBase queries by performing most joins in‑process; the default 10‑second window cut HBase calls by ~80%.

Data Cleansing Component (Logwash)

Logwash is a Flink‑based extraction engine using Freemarker templates to parse text or JSON logs, supporting arithmetic, conditionals, and loops. It serves hundreds of real‑time cleaning tasks for Weibo advertising.

FlinkStream Component Library

To avoid duplicated code, a reusable library named FlinkStream was created, abstracting Sources, Operators, Sinks, Streams, Topologies, and configuration‑driven deployment.

Example YAML configuration (excerpt):

run_env:
  timeCharacteristic: "ProcessingTime"
  restart:
    type: fixedDelayRestart
  checkpoint:
    type: "rocksdb"
streams:
  impStream:
    type: "DefaultStream"
    config:
      source:
        type: "Kafka011"
        config:
          parallelism: 20
      operates:
        - type: "StringToMap"
        - type: "SplitElement"
      transforms:
        - type: "KeyBy"
        - type: "CountWindowWithTimeOut"
      sink:
        - type: "Kafka"

Deployment is performed via a task‑management platform that uploads the executable JAR, selects Flink version, sets memory and parallelism, and optionally enables checkpoint recovery.

FlinkSQL Extensions

Because Flink 1.6/1.8 lacked full DDL support, the team extended the SQL API using Apache Calcite to enable creation of source tables, dimension tables, views, result tables, and custom UDFs.

Real‑Time Data Warehouse Construction

The warehouse follows a three‑layer model: ODS (raw ingestion), CDM (common data model with DIM, DWD, DWS), and ADS (application‑specific services). Data is progressively aggregated and transformed, providing unified metrics and supporting downstream analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Real-time Streaming Data Warehouse Data Architecture Checkpoint

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.