Big Data 21 min read

Mastering Real‑Time Stream Processing with Flink: From Fundamentals to Kuaishou Production

This article walks through the evolution of big‑data systems to modern stream processing, explains core Flink concepts such as state, checkpoints, event‑time and windowing, and details Kuaishou’s real‑time UV calculation and fast‑failover techniques for high‑availability streaming jobs.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering Real‑Time Stream Processing with Flink: From Fundamentals to Kuaishou Production

1. Introduction to Stream Computing

Stream computing processes unbounded data in real time, delivering results continuously. The talk first reviews the evolution of big‑data systems from batch‑oriented MapReduce to modern streaming engines, compares batch and stream processing, and outlines key challenges every streaming engine must address.

1.1 History of Big‑Data Systems

From Google’s 2003 MapReduce to Hadoop, Flume, Storm, Spark Streaming, Beam, and finally Flink (popular since 2016), each generation added capabilities such as fault‑tolerance, low latency, and exactly‑once semantics. Kafka, while not a compute engine, provides durable log storage that many stream processors rely on for replay and state recovery.

1.2 Batch vs. Stream

Batch (e.g., MapReduce) handles bounded data with high throughput but high latency; stream engines like Flink handle unbounded data with low latency, requiring sophisticated fault‑tolerance mechanisms such as checkpoints and state snapshots.

1.3 Key Streaming Problems

Four fundamental questions: What (operators – element‑wise, aggregating, composite), Where (window types – tumbling, sliding, session), When (triggering via watermarks), and How (result accumulation strategies: discarding, accumulating, retracting).

2. Flink Key Technologies

2.1 Flink Overview

Flink is a distributed engine supporting both stream and batch workloads. It runs on resource managers such as Kubernetes, YARN, or Mesos, reads from sources like Kafka, and writes to downstream systems. Its two core concepts are State (managed, fault‑tolerant) and Event Time (watermarks drive window completion).

2.2 Checkpoint Mechanism

Flink periodically injects barriers that travel with the data flow. When all downstream operators have recorded the barrier, a consistent snapshot is stored in shared storage. On failure, the job can resume from the latest checkpoint without data loss.

Example snapshot flow:

Source receives barrier n and persists offset=7.

Task receives barrier n and persists sum=21.

Sink commits the snapshot, completing the checkpoint.

2.3 Event Time

Event time reflects the timestamp of the original event; processing time reflects the system clock. Watermarks indicate that all events with timestamps ≤ watermark have arrived, allowing windows to fire safely.

2.4 Window Mechanism

Flink supports tumbling, sliding, and session windows. Two main window functions are ProcessWindowFunction (full‑state, higher cost) and AggregateFunction (incremental, lower cost).

3. Kuaishou Production Practices

3.1 System Architecture

Kuaishou’s pipeline consists of data ingestion, Flink real‑time computation, data application, and visualization. Each layer is decoupled yet tightly integrated.

3.2 Real‑time UV Metric

Unique visitor (UV) counting requires handling massive device IDs, low latency, and high stability. The solution includes:

Dictionary mapping deviceId → long id (via HBase, Redis, or Flink‑managed global dictionary).

Skew handling through data shuffling or local pre‑aggregation.

Incremental computation using a keyed‑state bitmap and per‑minute window triggers.

SQL example (early‑fire enabled):

SELECT TUMBLE_ROWTIME(eventTime, interval '1' day) AS rowtime,
       dimension,
       COUNT(DISTINCT id) AS uv
FROM person
GROUP BY TUMBLE(eventTime, interval '1' day), dimension;

Configuration to emit results every minute:

table.exec.emit.early-fire.enabled=true
table.exec.emit.early-fire.delay=60s

3.3 Fast Failover Strategies

Three levels of rapid recovery:

Region failover : only the failed sub‑graph restarts.

Local recovery : checkpoints are also stored on local disks for quicker restart.

Container‑level redundancy : standby containers replace failed ones instantly.

Machine‑level redundancy : redundant containers are spread across hosts; failure detection within 5 seconds triggers immediate container substitution, reducing recovery from >3 min to <30 s.

4. Summary

The presentation traced the evolution of big‑data systems, explained core streaming concepts, detailed Flink’s state, checkpoint, event‑time, and window mechanisms, and shared Kuaishou’s real‑time UV computation and fast‑failover techniques for high‑availability streaming workloads.

Q&A

Can I skip Storm/Spark and start with Flink? Yes; Flink is a native stream engine with exactly‑once guarantees, unlike Storm or Spark Streaming.

How to handle TaskManager heartbeat timeout? Default heartbeat is 10 s, timeout 50 s; jobs fail and are restarted if HA is enabled.

How to process windows spanning two days with late events? Use RocksDB state backend and incremental windowing to keep state size manageable.

Will YARN automatically restart Flink jobs after a restart? With high‑availability (Zookeeper) enabled, jobs are automatically relaunched.

Does Flink replace Kafka Streams? Kafka is a messaging system; Flink provides richer computation capabilities.

What about Apache Beam? Beam offers a unified API that can run on multiple engines, including Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkReal-time analyticsKafkafailoverWindowing
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.