Big Data 8 min read

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

This article provides a comprehensive guide to configuring Apache Flink—including job manager and task manager settings, high‑availability via Zookeeper, metrics reporting, as well as Kafka producer tuning and Yarn resource adjustments—to help practitioners optimize big‑data streaming jobs.

Big Data Technology & Architecture

Aug 6, 2020

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

Flink Parameter Configuration

jobmanager.rpc.address – address of the JobManager.

jobmanager.rpc.port – port of the JobManager.

jobmanager.heap.mb – JobManager heap size (1‑2 GB recommended).

taskmanager.heap.mb – TaskManager heap size, depends on workload.

taskmanager.numberOfTaskSlots – number of slots; in YARN mode it must not exceed yarn.scheduler.maximum-allocation-vcores. The required number of TaskManagers can be calculated as shown below.

num_of_tm = ceil(parallelism / slot) – i.e., parallelism divided by slot count, rounded up.

parallelism.default – default parallelism for jobs without explicit setting.

web.port – port of the Flink Web UI.

jobmanager.archive.fs.dir – directory for archiving completed jobs.

history.web.port – port of the HistoryServer UI.

historyserver.archive.fs.dir – must contain the same path as jobmanager.archive.fs.dir for the HistoryServer to read archives.

historyserver.archive.fs.refresh-interval – refresh interval for archived job directories.

state.backend – backend storage for checkpoints (e.g., rocksdb, filesystem, hdfs).

state.backend.fs.checkpointdir – default directory for checkpoint data and metadata.

state.checkpoints.dir – checkpoint storage directory.

state.savepoints.dir – savepoint directory.

state.checkpoints.num-retained – number of recent checkpoints to retain.

state.backend.incremental – enable incremental checkpointing.

akka.ask.timeout – timeout for JobManager‑TaskManager RPC; increase if network is congested.

akka.watch.heartbeat.interval – heartbeat interval for TaskManager health checks.

akka.watch.heartbeat.pause – pause duration after which a missing heartbeat marks the TaskManager as dead.

taskmanager.network.memory.max – maximum network buffer memory.

taskmanager.network.memory.min – minimum network buffer memory.

taskmanager.network.memory.fraction – fraction of JVM memory used for network buffers; overridden if max/min are set.

fs.hdfs.hadoopconf – path to Hadoop config files (deprecated; use HADOOP_CONF_DIR).

yarn.application-attempts – number of JobManager restart attempts; should not exceed yarn.resourcemanager.am.max-attempts in yarn-site.xml.

Flink HA (Job Manager) Configuration

high-availability: zookeeper – use Zookeeper for HA.

high-availability.zookeeper.path.root: /flink – Zookeeper node name for Flink metadata.

high-availability.zookeeper.quorum: zk1,zk2,zk3 – addresses and ports of Zookeeper ensemble.

high-availability.storageDir: hdfs://nameservice/flink/ha/ – directory in HDFS where JobManager metadata is stored; Zookeeper only holds a pointer.

Flink Metrics Monitoring Configuration

metrics.reporters: prom – enable Prometheus reporter.

metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter

metrics.reporter.prom.port: 9250-9260 – port range for Prometheus metrics.

Kafka Related Tuning Configuration

linger.ms / batch.size – balance throughput and latency. batch.size defines the producer batch size; when the batch is full it is sent. linger.ms forces a send after the specified time even if the batch is not full.

ack determines the required acknowledgment level: all waits for all ISR replicas, 1 waits for the leader only, and 0 does not wait, affecting data reliability versus latency.

Kafka Topic Partitions and Flink Parallelism Relationship

The parallelism of Flink’s Kafka source should match the number of Kafka topic partitions to fully exploit parallel consumption.

Yarn Related Tuning Configuration

yarn.scheduler.maximum-allocation-vcores – maximum vCores per container.

yarn.scheduler.minimum-allocation-vcores – minimum vCores per container.

Flink TaskManager slot count must lie between the above two values.

yarn.scheduler.maximum-allocation-mb – maximum memory per container.

yarn.scheduler.minimum-allocation-mb – minimum memory per container.

Flink JobManager and TaskManager memory must not exceed the container’s maximum allocation.

yarn.nodemanager.resource.cpu-vcores – recommended 2‑3× physical CPU cores; setting too low underutilizes CPU.

Overall, careful adjustment of these parameters ensures stable, high‑performance Flink jobs running on YARN with Kafka sources and Zookeeper‑based HA.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Configuration kafka HA

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.