Big Data 33 min read

Real-Time Computing and Data Warehouse Solutions with Apache Flink: Architecture, Technology Selection, and Implementation

This article explores the evolution of real-time computing in the big data domain, detailing Apache Flink's capabilities, architectural designs, technology selections such as Kafka, Canal, HBase, ClickHouse, and provides practical implementation guides and case studies from Alibaba, Tencent, and other enterprises.

Big Data Technology & Architecture

Apr 6, 2021

Real-Time Computing and Data Warehouse Solutions with Apache Flink: Architecture, Technology Selection, and Implementation

Introduction to Real‑Time Computing in the Big Data Era

Since 2010, offline batch processing technologies such as Hadoop and Hive have entered the mainstream of large‑scale data processing. In recent years, real‑time stream processing frameworks like Storm, Spark, and Flink have emerged, with Flink becoming a leading solution due to its strong computation ability and advanced design.

Flink Real‑Time Computing

For developers in the big‑data or backend fields, the following scenarios are common:

我是抖音主播，我想看带货销售情况的排行？
我是运营，我想看到我们公司销售商品的 TOP10？
我是开发，我想看到我们公司所有生产环境中服务器的运行情况？
......

In the Hadoop era, data was typically stored in HDFS and reported via Hive, or stored in ClickHouse/PostgreSQL for direct SQL queries. These approaches suffer from latency (offline reports are delayed by a day) and scalability issues (complex SQL on large tables can crash the database).

Flink addresses these problems by providing low‑latency, high‑throughput real‑time computation, serving both real‑time dashboards and downstream data pipelines.

Flink Real‑Time Data Warehouse

A traditional offline data warehouse stores data in layered Hive tables (raw, detail, summary, business). The offline process introduces at least a T+1 day delay. Real‑time data warehouses eliminate this delay; a typical architecture is shown below.

Technology Selection

The author shares experience from production environments (e.g., Alibaba) to explain how to choose components for a real‑time computing platform and a real‑time data warehouse.

Real‑Time Computing Engine

Key requirements are real‑time latency, high stability, and zero error, especially for large‑scale e‑commerce events where QPS can reach tens of thousands.

Example: during a Double‑11 sales event, a stalled real‑time dashboard could lead to severe consequences for developers.

Canal – Binlog Synchronization Tool

Canal parses MySQL binlog, providing incremental data subscription without affecting the source database. It acts as a pseudo‑replica to read and parse binlog, supporting massive cross‑data‑center synchronization.

Kafka – Decoupling and Massive Data Support

Kafka is used to decouple data production and consumption, handle traffic spikes, and provide high‑throughput, low‑latency messaging. Its main features include:

High throughput – millions of messages per second.

Low latency – O(1) access time even for TB‑scale data.

High fault tolerance – node failures are tolerated.

Reliability – durable disk persistence with high read/write efficiency.

Rich ecosystem – tight integration with Flink and other stream processing frameworks.

Real‑Time Computing Service – Flink

Flink provides:

Powerful state management with strong fault‑tolerance.

Rich APIs (DataSet, DataStream, Flink SQL).

Comprehensive ecosystem support for sources (Kafka, MySQL) and sinks (HDFS, Elasticsearch).

Unified batch‑stream processing, allowing direct writes to Hive.

After processing, Flink can output results to three destinations:

Highly aggregated metrics stored in Redis or HBase for front‑end queries.

Detailed data for operational analysis.

Real‑time messages for downstream consumption.

OLAP Database Selection

Common open‑source OLAP engines include Hive, Hawq, Presto, Kylin, Impala, SparkSQL, Druid, ClickHouse, and Greenplum. Each has trade‑offs in data volume, flexibility, and performance. Typical recommendations:

Hive/Hawq/Impala – SQL‑on‑Hadoop for batch workloads.

Presto/SparkSQL – in‑memory query planning.

Kylin – pre‑computation using space‑time trade‑off.

Druid – real‑time ingestion + real‑time analytics.

ClickHouse – columnar OLAP engine with exceptional single‑table query speed.

Greenplum – PostgreSQL‑style OLAP with linear scalability.

Flink Real‑Time Data Warehouse

The architecture evolves from offline to real‑time, with a typical diagram shown below.

The design reuses offline concepts while ensuring stability and usability for real‑time workloads. Core technologies include Kafka, Flink, and HBase (for state storage) plus Redis for fast key‑value metrics.

Batch‑Stream Unified Future

Since Flink 1.12, Flink‑Hive integration allows Flink to read/write Hive directly, enabling a single codebase for both batch and streaming jobs, eliminating data inconsistency between offline and online pipelines.

Large‑Scale Real‑Time Platforms and Data Warehouse Solutions

Author’s Experience

The author uses a Kappa architecture and highlights three main challenges:

Too many data sources.

Large time gaps between sources (e.g., user actions spanning days).

Strong consistency requirements between offline and real‑time data for business metrics.

Key technical choices:

HBase as intermediate state storage to handle massive state size and ensure fault‑tolerance.

Real‑time trigger mode: Flink consumes HBase change streams, performs reverse lookups, and writes results back to HBase.

Dual‑write to ADB and Hologres for OLAP queries.

Offline synchronization by feeding the same Kafka stream into Hive, achieving unified batch‑stream logic after Flink 1.12.

Future plans include tiered metrics storage to reduce pressure on ADB/Hologres.

Tencent Kankan Real‑Time Data System Design

Tencent Kankan processes trillions of events daily, requiring sub‑second latency. Its three‑layer architecture consists of:

Data collection layer – Kafka decouples ingestion from business systems.

Real‑time data warehouse layer – Flink performs minute‑level and medium‑level aggregations.

Real‑time storage layer – ClickHouse and MySQL store aggregated results.

Flink is chosen for its exactly‑once semantics, lightweight snapshot mechanism, and efficient dimension table joins with HBase.

Alibaba Batch‑Stream Unified Data Warehouse

After Flink 1.12, Alibaba integrated Flink with Hive, enabling a single pipeline for both batch and streaming workloads. This unified approach reduces code duplication and eliminates data drift between offline and online layers.

Practical Case: Real‑Time Statistics Project

Architecture Design

The end‑to‑end pipeline includes:

Log data collection (Flume → Kafka).

Log cleaning.

Real‑time computation (Flink).

Result storage (Redis/HBase/MySQL).

Flume and Kafka Integration and Deployment

Download and extract Flume 1.8.0: tar zxf apache-flume-1.8.0-bin.tar.gz Configure two source agents (Agent 1 & Agent 2) to tail log files and send data via Avro to a third agent, which forwards to Kafka.

# Define component names
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Source – tail log file
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/logs/access.log

# Sink – Avro to downstream agent
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = flumeagent03
a1.sinks.k1.port = 9000

# Channel – file‑based buffer
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/temp/flume/checkpoint
a1.channels.c1.dataDirs = /home/temp/flume/data

# Bindings
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start the agents:

$ flume-ng agent -c conf -n a1 -f conf/log_kafka.conf >/dev/null 2>&1 &

Configure the third agent to receive Avro data and write to Kafka:

# Component names
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Source – Avro listener
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 9000

# Sink – Kafka
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = log_kafka
a1.sinks.k1.brokerList = 127.0.0.1:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20

# Channel – memory buffer
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bindings
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start the third agent:

$ flume-ng agent -c conf -n a1 -f conf/flume_kafka.conf >/dev/null 2>&1 &

Consume the Kafka topic in Flink, deserialize JSON messages into a POJO:

public class UserClick {
    private String userId;
    private Long timestamp;
    private String action;
    // getters, setters, constructor omitted for brevity
}

enum UserAction {
    CLICK("CLICK"),
    PURCHASE("PURCHASE"),
    OTHER("OTHER");
    private String action;
    UserAction(String action) { this.action = action; }
}

Flink job setup (event‑time, watermark, state backend):

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setStateBackend(new MemoryStateBackend(true));

Properties props = new Properties();
props.setProperty("bootstrap.servers", "127.0.0.1:9092");
props.setProperty(FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS, "10");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>("log_user_action", new SimpleStringSchema(), props);
consumer.setStartFromEarliest();

DataStream<UserClick> dataStream = env.addSource(consumer)
    .name("log_user_action")
    .map(msg -> {
        JSONObject record = JSON.parseObject(msg);
        return new UserClick(record.getString("user_id"), record.getLong("timestamp"), record.getString("action"));
    })
    .returns(TypeInformation.of(UserClick.class));

SingleOutputStreamOperator<UserClick> withWatermarks = dataStream.assignTimestampsAndWatermarks(
    new BoundedOutOfOrdernessTimestampExtractor<UserClick>(Time.seconds(30)) {
        @Override public long extractTimestamp(UserClick element) { return element.getTimestamp(); }
    });

Windowed PV/UV calculation using a tumbling processing‑time window and a custom ProcessWindowFunction:

public class MyProcessWindowFunction extends ProcessWindowFunction<UserClick, Tuple3<String,String,Integer>, String, TimeWindow> {
    private transient MapState<String,String> uvState;
    private transient ValueState<Integer> pvState;
    @Override public void open(Configuration parameters) {
        uvState = getRuntimeContext().getMapState(new MapStateDescriptor<>("uv", String.class, String.class));
        pvState = getRuntimeContext().getState(new ValueStateDescriptor<>("pv", Integer.class));
    }
    @Override public void process(String key, Context ctx, Iterable<UserClick> elements, Collector<Tuple3<String,String,Integer>> out) {
        int pv = 0;
        for (UserClick e : elements) { pv++; uvState.put(e.getUserId(), null); }
        Integer curPv = pvState.value();
        pvState.update((curPv == null ? 0 : curPv) + pv);
        int uv = 0; for (String uid : uvState.keys()) uv++;
        out.collect(Tuple3.of(key, "uv", uv));
        out.collect(Tuple3.of(key, "pv", pvState.value()));
    }
}

withWatermarks.keyBy(UserClick::getUserId)
    .window(TumblingProcessingTimeWindows.of(Time.days(1), Time.hours(-8)))
    .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(20)))
    .process(new MyProcessWindowFunction())
    .addSink(new RedisSink<>(conf, new MyRedisSink()));

Redis sink implementation (using HASH):

public class MyRedisSink implements RedisMapper<Tuple3<String,String,Integer>> {
    @Override public RedisCommandDescription getCommandDescription() {
        return new RedisCommandDescription(RedisCommand.HSET, "flink_pv_uv");
    }
    @Override public String getKeyFromData(Tuple3<String,String,Integer> data) { return data.f1; }
    @Override public String getValueFromData(Tuple3<String,String,Integer> data) { return data.f2.toString(); }
}

Summary

Apache Flink, combined with Kafka, Canal, HBase, ClickHouse, and other big‑data components, forms a powerful real‑time computing platform and data warehouse. The unified batch‑stream approach introduced in Flink 1.12 simplifies development, ensures strong consistency, and meets the demanding latency and throughput requirements of modern internet enterprises.

欢迎点赞+收藏+转发朋友圈素质三连

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Data Warehouse Real-Time Computing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.