Big Data 16 min read

How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing

This article details WeChat’s Gemini‑2.0 real‑time streaming platform built on Flink, explaining two key stability enhancements: a TaskManager‑level partial failure recovery that avoids data loss during node crashes, and a load‑balancing scheduler that evenly distributes tasks across TaskManagers to improve resource utilization and reduce latency.

WeChat Backend Team

Jun 1, 2023

How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing

Introduction

Flink has high throughput and low latency for big data stream processing, serving WeChat’s Gemini‑2.0 real‑time platform for recommendation, data warehouse, risk control, and other scenarios.

Gemini‑2.0 is WeChat’s internal cloud‑native big data platform built on Tencent Cloud TKE, providing compute and AI support with the following features: Compute and storage are completely separated. Unified orchestration and scheduling of big data and AI frameworks. Optimized high‑performance computing components. Flexible and efficient scalability.

As WeChat’s business rapidly expands, real‑time data applications have become ubiquitous, raising the requirements for streaming engine stability and performance. Since 2020, WeChat has built a cloud‑native, high‑performance, and reliable Flink‑on‑K8s platform. During operation, two main stability issues were identified: (1) job restarts caused by machine failures or network jitter; (2) scheduling imbalance leading to back‑pressure and OOM.

1. Partial Failure Recovery

1.1 Problem

In real‑time recommendation, occasional data loss is tolerable, but machine failures cause Flink jobs to restart; complex execution graphs can take more than ten minutes to recover, affecting recommendation quality.

Flink’s current task‑failure recovery strategies focus on data integrity:

Region : restart only the minimal connected execution sub‑graph containing the failed task.

Full : restart the entire execution graph.

Region works only when the execution graph is not fully connected; most jobs are fully connected, which degrades to Full and results in long recovery times and data gaps.

Community proposal FLIP‑135 attempted lossy fast recovery but stalled. In practice, most partial failures stem from TaskManager loss due to node crashes, network jitter, OOM‑Kill, etc.

1.2 Solution

We implemented a TaskManager failure recovery strategy that keeps tasks on healthy TaskManagers running while quickly restoring tasks on the failed TaskManager, minimizing data loss.

The implementation requires changes at both the execution layer and the control layer.

1.2.1 Execution‑layer implementation

When a TaskManager fails, its tasks fail. We handle two cases:

1) Upstream failure, downstream continues

Upstream failure triggers InputChannel exceptions; we replace the InputChannel with one that clears its queue and pauses, preventing downstream from pulling data. After the upstream task restarts, JobManager notifies downstream tasks of new connection info, allowing them to resume.

2) Downstream failure, upstream continues

Downstream failure closes SubpartitionView, causing ResultSubpartition queues to fill and back‑pressure. We introduce a new ResultSubpartition that, when paused, discards incoming data. When the downstream task restarts, it reconnects to the upstream, which creates a new SubpartitionView and resumes normal data flow.

1.2.2 Control‑layer implementation

JobManager monitors TaskManager heartbeats. Upon detecting loss, it wraps the exception as TaskManagerLostException and invokes a custom FailoverStrategy ( RestartTaskManagerFailoverStrategy) that only restarts tasks on the failed TaskManager, delegating other failures to the default region strategy.

Configuration to enable the strategy:

jobmanager.execution.failover-strategy=taskmanager

Tests with a job running on four TaskManagers show no data drop (zero‑flow) and recovery within roughly one minute, preserving real‑time recommendation latency.

2. Load‑Balancing Scheduler Optimization

2.1 Problem

Flink’s default random task scheduling leads to uneven distribution of tasks across TaskManagers, causing some to become OOM or experience back‑pressure.

Example: a 16‑TaskManager cluster with 32 slots each processing a 64‑partition Pulsar source. Ideally, each TaskManager would handle an equal share of partitions, but actual placement is skewed.

2.2 Application‑layer optimization

Before kernel changes, we equalized the parallelism of all operators to form a long operator chain, then used a thread‑pool inside each slot to increase concurrency. This reduces overall job parallelism, shortens startup time, and lowers Flink scheduling overhead.

2.3 Application‑layer effect

After rewriting FlatMap and Sink with a thread‑pool, the three operators form a chain with parallelism 64, achieving balanced load across TaskManagers.

2.4 Kernel‑layer optimization

We redesigned slot allocation by grouping operators into ExecutionSlotSharingGroups. After all slots are available, we assign groups to TaskManagers based on category IDs, ensuring an even distribution.

The final placement yields balanced load across TaskManagers without requiring user‑level code changes.

Configuration to enable the balanced scheduler:

jobmanager.ng-scheduler.balance-match-between-taskmanagers=true

In production, memory utilization becomes balanced, allowing a 15‑20% reduction in resource allocation for typical jobs and up to 30% for large jobs.

3. Summary

This article presents two internal WeChat optimizations for Flink stability: (1) a TaskManager‑level partial failure recovery that prevents data loss and quickly restores service for real‑time recommendation workloads, and (2) both application‑ and kernel‑level load‑balancing strategies that evenly distribute tasks, improve stability, and reduce resource consumption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink stream processing Kubernetes load balancing TaskManager Recovery

Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.