How Real-Time Data Platforms Evolve: From Storm to Flink and Kubernetes
This article summarizes Wang Xinchun's 2018 DAMS China Data Asset Management Summit talk, detailing the current state, core services, responsibilities, evolution, architecture, challenges, and future directions of a large‑scale real‑time data platform built on Storm, Spark, Flink, and Kubernetes, including a unified data management approach.
Real‑time Platform Overview
The platform runs three streaming engines: Storm , Spark Streaming and Flink . Flink is the strategic focus for new applications because of its native state handling, checkpointing and SQL support.
Core real‑time services include:
Personalised recommendation engine (millisecond‑level scoring).
Executive dashboards that aggregate PV/UV, sales and financial‑risk metrics.
Data ingestion pipelines: MySQL binlog (via VDP) and user‑behaviour clickstream are parsed and published to Kafka.
Financial risk‑control and price‑comparison services that consume the same streams.
The platform is split into two logical layers:
Real‑time compute platform – provides stable data feeds, monitors job health, and offers development assistance such as solution reviews and resource‑usage guidance.
Real‑time foundational data – cleans, widens and enriches raw streams (e.g., mapping device IDs to internal user IDs) and publishes unified wide tables for downstream business teams.
Evolution Timeline
2013‑2014 (early stage) – Storm was the first mature framework; attempts to run SQL on Storm failed due to limited framework maturity.
2015‑2017 – Introduction of VRC (real‑time task‑management, monitoring, alerting and data‑quality subsystem). Spark Streaming added as a secondary engine.
2018‑present – Migration to Flink and Spark Structured Streaming, adoption of Kubernetes for unified scheduling of Storm/Flink/Spark jobs, and integration with the machine‑learning platform. Resource spikes during major sales events (e.g., 618, Double‑11) drove the shift to a shared‑resource model.
Platform Architecture
Two primary data sources feed the platform:
User‑behaviour streams (mobile app, WeChat, etc.).
Online transactional databases (MySQL binlog).
Ingestion components (VDP, custom binlog parsers) write the normalized events to Kafka topics. Downstream compute engines (Storm, Spark, Flink) consume the topics, apply business logic, join with reference data (e.g., device‑to‑user mapping) and write results to:
Real‑time dashboards.
Recommendation services.
Feature stores for model training.
Metadata for real‑time data includes location, access protocol and format (PB, JSON). A thin abstraction layer makes these details transparent to developers.
Development & Technical Challenges
Complexity
Choosing appropriate parallelism for each operator; over‑provisioning wastes resources while under‑provisioning causes back‑pressure.
Selecting storage for intermediate state – large state may require RocksDB or external key‑value stores; slow external systems increase latency.
Exception handling across multiple streams – no single‑point failure model; need custom retry and dead‑letter pipelines.
Exactly‑once semantics – requires checkpointing, idempotent sinks and coordinated commit protocols.
Metric consistency – real‑time aggregates must be reconciled with offline batch results for data‑quality checks.
Technical Difficulties
Out‑of‑order events – require watermark strategies and buffering to correctly join streams.
Throughput vs. latency trade‑off – larger batch windows increase TPS but add latency; tuning batch size per job is essential.
Massive keyed state (e.g., UV per product) – state size can reach billions of keys; requires state backend scaling and compaction.
State durability and recovery – Flink’s checkpoint/savepoint mechanism simplifies recovery compared with Storm’s manual state dumps.
Future Roadmap
Enrich foundational data – publish unified wide tables that hide raw‑data complexity; downstream jobs consume a single stream.
Real‑time / batch convergence – write cleaned streams to HDFS every five minutes, enabling batch jobs to reuse the same source without duplicate ETL.
Latency tiers – define “second‑level” (strict sub‑second) and “minute‑level” (near‑real‑time) services to set realistic SLA expectations.
Unified Kubernetes scheduling – all Storm, Flink and Spark jobs are launched as containers; the scheduler automatically scales resources based on load.
Streaming‑SQL & notebook environment – promote Flink SQL and Spark Structured Streaming APIs; provide a web‑based notebook for rapid prototyping.
Resource efficiency – case studies show Flink reduces CPU and memory consumption by ~66 % for UV calculations compared with Storm.
Unified Data Management (UDM)
UDM is a service layer that abstracts binary data stored in Kafka, Redis, Tair and HDFS. It consists of:
Location Manager – maps logical names to physical addresses.
Schema Metastore – stores protobuf/Avro schemas and exposes them via a web UI.
Client Proxy – provides audited, monitored client APIs (Java/Scala) that can be used inside Flink, Spark or Storm jobs.
Example of a unified development model that runs unchanged on Flink or Spark:
import com.company.udm.ClientProxy;
// Create a unified source that reads from Kafka (binary PB payload)
val source = ClientProxy.kafkaSource(topic = "user_events", schema = "UserEventProto")
// Simple transformation – filter and map to a POJO
val transformed = source
.filter(_.eventType == "click")
.map(event => (event.userId, event.itemId, event.timestamp))
// Write the result to a sink that works for both Flink and Spark
ClientProxy.writeSink(transformed, sinkName = "click_stream")The same code can be compiled against the Flink DataStream API or Spark Structured Streaming API; only the execution environment changes.
Operational Practices
Containerization – all compute jobs run in Docker containers managed by Kubernetes; storage systems (Kafka, Redis, Tair) remain external because of state‑size and latency constraints.
State management – two‑level storage (in‑memory cache + external KV store) is used when external latency is high. Flink’s checkpointing replaces Storm’s manual state snapshots.
Data skew detection & mitigation – monitor per‑task output volume; if the max/min ratio exceeds 10×, trigger a two‑stage aggregation or isolate hot keys to dedicated partitions.
Ordering guarantees – local ordering is ensured by routing all records of the same logical key (e.g., order ID) to the same Kafka partition. Global ordering across partitions is not required for most business use‑cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
