Big Data 10 min read

Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

This article examines how to run large offline Apache Flink jobs stably by analyzing task slot and resource configurations, CPU‑to‑slot ratios, and memory usage, offering practical recommendations to improve speed, reduce resource consumption, and avoid Hadoop‑related failures.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

Apache Flink is a distributed processing engine for both unbounded (real‑time) and bounded (offline) data streams. This article explains how to run large offline Flink jobs stably, reduce resource consumption, and lower cost.

Typical offline jobs on the platform involve HDFS‑to‑Hive data‑cleaning tasks that are stateless, processing each record independently.

Example job: 301 files (~9.6 GB each, ~240 billion records). Processing uses regular‑expression extraction, runs with 301 parallelism, each TaskManager (TM) configured with <10 cores, 10 slots, 15 GB memory, creating 32 YARN containers.

The job processes at about 120 million records per minute but occasionally fails with Hadoop errors such as connection reset by peer and EOFException: End of File Exception , causing data loss.

Related Theory

1. Task Slots and Resources

A TaskManager is a JVM process that can host one or more subtasks, each occupying a slot. Memory is partitioned per slot, while TCP connections, heartbeats, shared datasets, and CPU are shared across slots.

2. Recommended CPU‑to‑Slot Ratio

Stack Overflow suggests a default of one slot per CPU core; with hyper‑threading, each slot can take two or more hardware thread contexts. Thus, on hyper‑threaded machines use numOfSlots = 2 × numOfCores , otherwise numOfSlots = numOfCores .

3. Is the TM Resource Configuration Appropriate?

The current TM setting (<10 cores, 10 slots, 15 GB>) leads to instability. Similar issues on Stack Overflow recommend deploying multiple smaller TaskManagers per host to improve load distribution.

4. Alibaba Recommendations for TM Resources

Too small a TM harms job stability and reduces slot sharing efficiency; too large a TM creates a single‑point‑of‑failure risk because many jobs share the same TM.

Problem Analysis and Solutions

Reduce TM resources (cores, memory) to improve stability when job scale grows.

For I/O‑intensive jobs, consider a CPU‑to‑slot ratio of 2:1 as Flink suggests, but validate experimentally.

Smaller TMs increase the number of TMs, which may raise JobManager scheduling pressure; evaluate impact on stability and efficiency.

Test Results Comparison

Job 1

Source: 301 files (~9.6 GB each, ~240 billion records). Transformation: regex extraction per record.

Speed: Increasing slots per TM yields modest speed gains; beyond a certain point, additional slots do not improve throughput.

Memory & Resource Sharing: More slots improve speed slightly due to shared TCP, heartbeat, and data structures, but memory increase alone does not boost Flink performance.

Stability: The configuration with the highest speed also runs stably across multiple executions.

Job 2

Source: 1,000+ files (~1 GB each, ~450 million records). Same regex processing.

Findings: Because the job is stateless, increasing memory does not improve speed; higher concurrency (more slots) improves scheduling efficiency without harming throughput.

Job 3

Source: 1,000+ files (~tens of KB each, ~3.2 million records). Stateless processing.

Observation: Raising slot count under constant cores reduces the number of tasks per TM, enhancing resource sharing and shortening scheduling time.

Offline Task Performance Tuning Summary

1. Task Stability

TM configuration should not exceed <10 cores, 10 slots, 15 GB>; reducing to <4 cores, fewer slots> improves stability and avoids HDFS or Flink internal communication issues.

2. I/O‑Intensive Task Efficiency

For large files, match cores to slots (numOfCores = numOfSlots) to achieve peak speed (~120 M records/min). For many small files, a higher slot‑to‑core ratio (numOfCores = n × numOfSlots) improves scheduling despite slight speed reduction.

3. Memory Slimming

Stateless jobs do not require large memory; aligning cores with memory (e.g., 1.5 cores per GB → 1 core per GB) can save up to 500 GB for 1,000 concurrent tasks and reduce YARN container pressure.

PerformanceBig DataApache Flinkoffline jobsResource TuningTask Slots
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.