Big Data 15 min read

How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink

This article examines Ele.me’s big‑data platform evolution, comparing Storm, Spark Streaming, Structured Streaming, and Flink, detailing their architectures, consistency semantics, performance trade‑offs, and why Flink became the preferred real‑time computation engine for the company.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink

Platform Overview

Ele.me's big data platform ingests data from multiple sources into Kafka, then uses Storm, Spark, and Flink for real‑time computation, finally persisting results to various storage systems. The cluster processes about 60 TB of data daily, runs 1 billion compute jobs, and consists of 400 nodes. Spark and Flink run on YARN (Flink on YARN for job‑manager isolation), while Storm operates in standalone mode.

Application Scenarios

Consistency Semantics

The platform distinguishes four consistency models: at‑most‑once (fire‑and‑forget), at‑least‑once (re‑send until success), exactly‑once (checkpoint‑based state replay), and at‑least‑once + idempotent = exactly‑once when downstream sinks provide upsert semantics.

Storm

Ele.me used Storm in its early years (pre‑2016). Key characteristics include tuple‑based processing, millisecond‑level latency, primary Java support (extended to Python/Go via Apache Beam), and limited SQL capabilities through an internal wrapper (Typhon) and a YAML‑based tool (Flux).

Usability: high entry barrier limits adoption.

StateBackend: relies on external KV stores such as Redis.

Resource allocation: workers and slots pre‑configured, resulting in lower throughput.

Spark Streaming

Spark Streaming was introduced to allow users to define real‑time tasks with SQL. It follows a micro‑batch model, offering second‑level latency (≈500 ms), Java/Scala APIs, and tight integration with the Spark ecosystem (SparkSQL, MLlib, GraphX).

Advantages: unified Spark stack, high throughput, checkpointing on HDFS, YARN integration.

Limitations: multi‑stream joins are cumbersome, exactly‑once requires external idempotent sinks.

Structured Streaming

Adopting concepts from Google Dataflow, Structured Streaming introduces processing time, event time, and watermarks to handle out‑of‑order data. It provides stateful processing SQL/DSL, real multi‑stream joins (Spark 2.3+), and can achieve end‑to‑end exactly‑once when sinks support transactional writes.

Triggers: processing‑time and continuous (record‑by‑record) modes.

Continuous processing supports map‑like operations with low latency.

Exactly‑once guarantees require additional sink extensions (e.g., Kafka 0.11 transactions).

Complex Event Processing (CEP) is realized via the Drools rule engine.

Flink

Flink became the preferred engine because of its advanced streaming semantics, low latency, and robust exactly‑once support via a two‑phase commit protocol.

Architecture: JobManager (driver) and TaskManager (executors) communicate via Akka RPC; custom memory serialization and operator chaining reduce context‑switch overhead.

Checkpointing: fine‑grained snapshots stored in RocksDB or HDFS; supports savepoints for versioned recovery.

Parallelism: configurable per‑operator, enabling better resource utilization than Spark.

Features: supports processing‑time, event‑time, and ingestion‑time triggers; continuous processing; rich CEP; savepoints.

Drawbacks: SQL functionality still maturing; ML and graph processing are weaker than Spark’s.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkSparkStorm
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.