Big Data 7 min read

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big Data Technology & Architecture

May 21, 2025

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

Hello everyone, our interview experience sharing is back.

This is a record of one student's interview, and we will analyze several core questions that were asked one by one.

Because the whole interview record involves a lot of personal privacy, we only select a few core questions.

At the same time, a quick advertisement: if you also need interview accompaniment or a professional deep‑dive big‑data project, feel free to contact me and mention the purpose.

Question 1: How many resources were allocated to the largest online Flink task? How many tasks?

This question is very practical and appeared in the candidate's first project interview. It also determines the direction of subsequent questions, so we need to set a "discussion environment".

The task's resource consumption cannot be too small. Why?

Because a task with little resource allocation, no matter how complex the DAG is, is unlikely to encounter difficult problems, and later questions have no basis. So what is a reasonable answer for a real large‑company online environment?

In most complex business scenarios involving sorting, joins, etc., we can start to consider a task's resource consumption large when it uses about 100‑200 cores and terabyte‑level memory, at which point we begin to see complex backpressure, network resource allocation, and hotspot aggregation issues.

In extremely high‑traffic scenarios, a single task may consume thousands of cores; in an extreme case, a 3000‑5000‑core cluster might run only a handful of tasks.

The number of tasks in such a job is determined by the number of TaskManagers and the slots per TaskManager.

Specifically, the number of tasks for a single operator equals its parallelism, and the total number of tasks for the whole job equals the sum of the parallelisms of all operators.

Additionally, the total task count may be limited by operator characteristics and Flink's default optimizations, such as operator chaining, which can reduce the theoretical task count.

Question 2: What problems did the task encounter? What was the solution process and thinking?

This question is an extension of the first one; if your first scenario is very small and simple, this question will also be hard to answer well.

The most common problems a online task may face fall into the following categories:

Data issues – e.g., incorrect sorting leading to disorder, data hotspots, data loss, etc.

Performance issues – backpressure, latency, throughput.

Stability issues – OOM in JobManager/TaskManager, checkpoint failures, insufficient network resources, etc.

Resource issues – improper resource configuration, parameter errors, etc.

Operations and monitoring issues – missing monitoring, missing logs, etc.

Pick one of the above points and discuss it in depth.

Question 3: What monitoring have you done for the online cluster and tasks?

This is also an open‑ended question.

Generally, cluster monitoring includes capacity, overall memory and CPU usage, disk usage, and basic health status of each node.

At the cluster level, we usually maintain a detailed list of running tasks and their resource consumption for governance.

For an individual task, basic metrics such as lag, resource usage, consumption, JVM health (GC, threads, etc.), checkpoint duration, failures, size, and per‑operator metrics like input/output volume, lookup hit rate and latency, as well as CPU and memory usage, are monitored; custom metrics may also be added depending on the scenario.

Question 4: What data quality monitoring have you done?

Skipped – we have discussed this many times; answer with a real scenario.

Thus, the core questions of a project are basically covered. Of course, specific solution designs will be explored based on the details in the resume, which we will not discuss here.

Finally, you are welcome to join our knowledge‑sharing community:

"300万字！全网最全大数据学习面试社区等你来" .

If this article helped you, don't forget to "view", "like", and "collect" – the three‑click combo!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring big data Flink interview Resource Allocation Performance Issues

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.