Big Data 20 min read

10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)

This article shares ten practical pitfalls encountered when moving hourly Spark session jobs to Flink, covering parallelism load imbalance, state TTL, checkpointing strategies, logging, JMX debugging, state migration risks, reduce vs process choices, input data validation, event‑time handling, and external storage considerations, along with concrete configuration snippets and performance tips.

dbaplus Community

Apr 20, 2021

10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)

Parallelism and Load Skew

When inspecting Flink UI you may notice that some subtasks process far more data than others because the operator does not receive an equal number of key groups. The function

public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) { return keyGroupId * parallelism / maxParallelism; }

distributes key groups based on maxParallelism, whose default is operatorParallelism + (operatorParallelism / 2). Setting maxParallelism to a multiple of the desired parallelism (e.g., parallelism = 10, maxParallelism = 20) balances the load and eases future scaling.

Keyed State TTL and mapWithState

For unbounded key spaces, configure a TTL timer to clean up unused keyed state; otherwise state can grow without bound. In default windows the TTL is applied automatically when the window clears. Using KeyedStateDescriptor you can add TTL, but the convenient mapWithState API hides the state and does not expose TTL configuration, so for production workloads you should implement a custom RichMapFunction or similar to control state lifecycle.

Checkpoint Restoration and Repartitioning

Large jobs (≈8 TB state) benefit from incremental checkpointing; in the example a checkpoint every 15 minutes transfers ~100 GB to object storage. Instead of creating time‑consuming savepoints, the team used retained checkpoints, allowing state recovery from the previous job without a full savepoint. Increasing RocksDB transfer threads from 1 to 8 ( state.backend.rocksdb.checkpoint.transfer.thread.num=8 and state.backend.rocksdb.thread.num=8) reduced publish and repartition time by tenfold.

Proactive Logging

Long‑running jobs should emit detailed logs when a window exceeds a threshold (e.g., 1 minute). This helps pinpoint data skew or unexpected input that causes a task to stall for hours. Log only on abnormal conditions to avoid performance impact.

Diagnosing Stuck Jobs via JMX

Configure TaskManagers for remote JMX monitoring by adding to flink-conf.yaml:

env.java.opts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1"

Then forward the port and connect with JConsole:

Add the property above to flink-conf.yaml.

Run kubectl port-forward flink-taskmanager-4 1099.

Open jconsole 127.0.0.1:1099.

JConsole reveals which thread is blocked, allowing rapid identification of the problematic method.

State Migration Risks

When moving data between states (e.g., from WindowContent to HistoricalSessions), ensure the first state is purged after migration. Using PurgingTrigger.of(EventTimeTrigger.create()) clears the window content, preventing duplicate events and OOM errors caused by late data merging.

// Purging the window's content allows us to receive late events without merging them twice with the old session
val sessionWindows = keyedStream
  .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
  .allowedLateness(Time.days(7))
  .trigger(PurgingTrigger.of(EventTimeTrigger.create()))

Reduce vs Process

Two ways to aggregate keyed data:

Store events in ListState and merge at session end (used with ProcessWindowFunction).

Use ReducingState to combine each new event on the fly (used with ReduceFunction).

For this workload, ListState gave better performance because RocksDB can merge without deserialization, whereas ReducingState incurs repeated serialization/deserialization.

Never Trust Input Data

Validate and filter malformed or excessive data early. For example, limit PV events per session to 300, discard events with page numbers > 300, and enforce a maximum session size (e.g., 4 MB) to avoid RocksDB limits (2³¹ bytes). Store per‑session metadata in a separate ValueState to decide when to stop accepting new data.

Event‑Time Pitfalls

Late‑arriving data from upstream (e.g., the Asimov Akka stream) can cause watermarks to advance based on fast partitions, making slow partitions appear late. Two mitigation strategies:

Partition Asimov’s output by the same key as Flink’s input, aligning watermarks.

Batch‑process late events with a custom trigger that always registers a timer, allowing downstream updates to be coalesced instead of per‑event.

Avoid Storing All Data in Flink

For very large datasets that are rarely accessed, store them outside Flink (e.g., in an external database). Keep only hot state in Flink’s RocksDB or memory. In the case study, historical session data was moved to Aerospike, reducing Flink state size and simplifying maintenance.

Conclusion

Migrating Spark jobs to Flink brings real‑time capabilities but requires careful attention to parallelism, state TTL, checkpointing, logging, debugging, and data validation. Following the ten lessons above helps avoid costly pitfalls and achieve a stable, performant streaming pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing State Management Performance Tuning checkpointing Spark migration

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.