Big Data 17 min read

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big Data Technology & Architecture

Nov 9, 2019

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

In this article, Xiaomi engineer Wang Jiasheng shares the evolution of streaming computation at Xiaomi, describing the transition from Storm and Spark Streaming to Apache Flink and the motivations behind the migration.

Flink was adopted because of its lower latency, richer time semantics, and built‑in state handling, which turned many complex batch‑style implementations into concise API calls. Over the past half‑year, Xiaomi has continuously improved stability, job management, logging, and monitoring to make Flink more user‑friendly and operable.

Key performance improvements observed after migrating from Spark Streaming to Flink include:

Unstateful job latency dropped from 16,129 ms to 926 ms (94.2% reduction).

Backend storage write latency fell from ~80 ms to ~20 ms due to the elimination of Spark's mini‑batch write spikes.

CPU core usage for a simple ETL job decreased from 210 cores to 32 cores (84.8% reduction).

These gains stem from Flink's "schedule‑data" model, which keeps computation operators running continuously and processes data as it arrives, avoiding the overhead of repeatedly scheduling short‑lived compute tasks as in Spark's "schedule‑compute" approach.

In Spark, the preferred location of each RDD partition is used to schedule compute close to data:

// RDD
/**
 * Optionally overridden by subclasses to specify placement preferences.
 */
protected def getPreferredLocations(split: Partition): Seq[String] = Nil

For KafkaRDD, Spark returns the leader host of the partition:

// KafkaRDD
override def getPreferredLocations(thePart: Partition): Seq[String] = {
    val part = thePart.asInstanceOf[KafkaRDDPartition]
    Seq(part.host) 
    // host: preferred kafka host, i.e. the leader at the time the rdd was created
}

Flink, like Storm, follows a "schedule‑data" model where the computation logic is initialized once and continuously consumes upstream data, reducing network transfer and resource initialization overhead.

However, Spark's schedule‑compute model can better handle slow or faulty nodes via blacklisting and speculation, which Flink currently lacks.

Mini‑batch versus true streaming is another major difference. Spark Streaming processes data in fixed‑size batches, leading to potential "long‑tail" effects where a large partition blocks the progress of others. Flink processes data as a continuous stream, improving resource utilization and reducing latency.

Serialization also differs significantly. Spark defaults to Java native serialization (or optional Kryo), while Flink implements its own high‑performance serializers (PojoSerializer, RowSerializer, TupleSerializer). Flink avoids storing class metadata for each record because data types are known ahead of time.

Example of Spark's connection‑pool pattern:

rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

To improve serialization of Thrift‑generated classes, the team registered a custom Kryo serializer:

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// register the serializer included with Apache Thrift as the standard serializer
// TBaseSerializer states it should be initialized as a default Kryo serializer
env.getConfig().addDefaultKryoSerializer(MyCustomType.class, TBaseSerializer.class);

When users inadvertently add non‑POJO fields (e.g., a logger) to a POJO class, Flink's serializer may fail, as shown in the following example:

// Not a POJO demo.public class Person {  private Logger logger = LoggerFactory.getLogger(Person.class);  public String name;  public int age;}

By customizing Flink’s Kryo serializer to use Thrift’s own serializer for Thrift classes, the team achieved significant gains in serialization efficiency and lowered the barrier for developers.

In summary, Flink offers superior low‑latency streaming and resource efficiency compared to Spark Streaming, while Spark’s mini‑batch model provides easier fault recovery. The practical experience at Xiaomi demonstrates that Flink can reduce operational costs and improve throughput for large‑scale data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Streaming serialization resource optimization Spark Streaming Mini-Batch

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.