Flink Task Auto-scaling Design and Implementation
This article presents the design and implementation of Flink task auto‑scaling, covering background, manual and automatic scaling mechanisms, architecture with RescaleCoordinator, persistence via Zookeeper and HDFS, scaling policies for parallelism, CPU and memory, and future plans for fine‑grained and time‑based resource adjustments.
Background: As Flink usage grew internally, the need to reduce costs and improve efficiency led to the development of task scaling capabilities.
Manual scaling: Provides manual adjustment of resources, reducing business impact from minutes to seconds by pre‑allocating containers and performing a Recover instead of a full restart.
Automatic scaling: Supports automatic adjustment of parallelism, TaskManager CPU and memory based on user‑defined policies.
Design steps: (1) Request new Container from ResourceManager via SlotPool, marking it; (2) Stop the job and delete ExecutionGraph; (3) Release old TaskManager, rebuild ExecutionGraph and restore from savepoint on marked TaskManager; (4) Persist the new resource settings to Zookeeper and HDFS for HA recovery.
Architecture: Added RescaleCoordinator in JobManager (HA‑maintained) that periodically checks scaling needs, notifies Dispatcher, which informs JobMaster to request TaskManagers from ResourceManager, release old ones, reschedule, and persist results.
Scaling policies: Adjust parallelism when Kafka latency high and CPU low (IO‑intensive) or idle slots; scale CPU based on utilization; scale memory based on usage and GC.
Future plans: Leverage offline/online workload peaks to improve utilization, and explore fine‑grained scaling of individual TaskManagers.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.