Big Data 15 min read

How FlinkSQL Auto‑Tuning Saves Resources and Guarantees SLA

This article describes the design and implementation of an automated FlinkSQL tuning system that monitors metrics, evaluates task health with rule‑based logic, calculates optimal resource adjustments, and performs fast scaling to reduce cluster waste, lower operational costs, and maintain SLA compliance.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
How FlinkSQL Auto‑Tuning Saves Resources and Guarantees SLA

Background

FlinkSQL greatly improves real‑time task development efficiency, but the growing number of jobs introduces problems such as unreasonable resource allocation, SLA violations, pronounced traffic peaks and valleys, and task OOM.

Goal

Automatically analyze tasks and adjust resources to handle sudden traffic spikes and sustained low‑traffic periods, thereby freeing manpower and reducing costs.

Key Benefits

Save cluster resources and improve utilization.

Lower operational costs while ensuring task effectiveness.

Architecture Design

The tuning process follows four steps: metric collection, state judgment, target resource determination, and scaling.

System components (Akka‑based):

DataManager – handles database operations.

JobCoordinator – starts, stops, and load‑balances tuning actors.

JobController – executes scaling actions via REST APIs.

JobSupervisor – core module that monitors tasks, decides scaling, and performs it (one actor per job).

JobSupervisor workflow is divided into three stages:

MetricsReader – reads predefined metrics from Prometheus and Elasticsearch.

RuleEvaluator – loads tuning rules, decides whether to scale, and generates action information.

ActionMerger – selects the optimal action when multiple actions are generated.

Metrics Collection

Monitored indicators include task latency, CPU/memory usage, back‑pressure, GC, in/out QPS, slot usage, and exception logs (e.g., OOM).

Data sources: Prometheus for metrics, Elasticsearch for logs, and logs collected via Log4j2 Kafka Appender.

A custom PrometheusReporter registers task information in Zookeeper for service discovery, adjusts CPU metrics to reflect core count, and exposes Kafka connector metrics to Flink.

State Judgment & Rules

Rules are stored in MySQL and expressed with Google Avator lightweight expressions, allowing real‑time updates without code changes.

Common strategies:

cpu‑based – adjust parallelism based on TaskManager CPU usage.

source‑delay‑based – adjust parallelism based on source delay (Kafka, Hologres).

slot‑utilization‑based – adjust parallelism based on slot utilization.

memory‑utilization‑based – adjust TaskManager memory based on memory usage and GC.

job‑exception‑based – increase memory when OOM or other resource‑related exceptions occur.

Strategies are combined (e.g., source‑delay + cpu + memory + exception) for better results.

Target Resource Calculation

ActionType indicates what to adjust (memory, parallelism, slots). Expressions compute the new target values, for example:

min(max(Math.ceil(max(max_job_out_qps_for_long / (max_job_out_qps + 0.001), 1) * 1.2 * job_config_parallelism),job_params_parallelism_min), job_params_parallelism_max)

Tuning Workflow

For scaling up, latency is checked every 30 seconds to 1 minute to enable rapid response; for scaling down, a full check runs every 30 minutes to reduce pressure on Prometheus.

Results

The auto‑scaling feature is deployed on over 400 Flink jobs (real‑time data warehouse and service operations), saving roughly 20% of cluster resources while keeping latency within minutes during peak or burst traffic.

Future Plans

Optimize task startup to reduce restart time.

Introduce predictive scaling.

Follow community developments in elastic scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkMetricsResource OptimizationAuto ScalingAkkaReal‑Time Computing
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.