Cloud Native 20 min read

How NetEase Media Scaled Flink with Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details NetEase Media's migration of most Flink jobs to a self‑built real‑time platform on Kubernetes, covering the benefits of K8s isolation, the chosen native deployment mode, performance‑critical optimizations, monitoring, resource‑recommendation, and future directions for cloud‑native streaming workloads.

NetEase Media Technology Team

May 23, 2023

How NetEase Media Scaled Flink with Kubernetes: Architecture, Optimizations, and Lessons Learned

Benefits of Flink on Kubernetes

Moving Flink jobs from Yarn to Kubernetes (K8s) solved resource‑isolation problems, reduced interference between jobs, and allowed a unified pool of resources for both real‑time and batch workloads, leading to noticeable cost‑saving and efficiency gains.

Deployment Architecture

Flink supports three deployment modes on K8s: Standalone, Operator, and Native. The Native mode, introduced in Flink 1.12 and now community‑recommended, was selected because it offers better resource utilization and lower maintenance overhead.

The Flink client uses the K8s API server to create ConfigMaps, JobManager Deployments, Services, and then the Resource Manager requests TaskManager Pods based on job requirements.

Key Challenges and Solutions

Job launch speed: Instead of building a dedicated Docker image for each job, an Init Container dynamically mounts job JARs and dependencies from HDFS/S3 at pod start, dramatically reducing image size and launch time.

Stdout collection: Replaced the default flink-console.sh with a custom flink-daemon.sh to run Flink as a background service, restoring stdout redirection and UI visibility.

Web UI exposure: Implemented Ingress (ClusterIP + Ingress) with Nginx Ingress Controller to provide stable, cost‑effective external access to the Flink UI.

State backend storage: Initially used a small HDFS cluster for snapshots and Zookeeper for HA; later mounted local SSDs for RocksDB state, labeling SSD nodes and scheduling state‑heavy jobs accordingly.

Mixed‑workload interference: Tuned several Flink parameters to improve resilience, e.g., increased kubernetes.transactional-operation.max-retries, set

high-availability.zookeeper.client.tolerate-suspended-connections

to true, and raised akka.ask.timeout and heartbeat.timeout.

Riverrun Real‑Time Computing Platform

After the PoC, NetEase built the Riverrun platform, supporting JAR/SQL jobs, online build, automatic resource‑recommendation, log collection, metrics aggregation, alerting, intelligent diagnosis, and batch operations.

Online build & release: Users configure Git repo, branch, and modules; the platform builds JARs in a standardized Maven environment, caches artifacts, and publishes them without manual packaging.

Resource recommendation: Based on recent two‑week P99 CPU and memory usage, the system suggests limits and adds buffers; recommendations are pushed to users and can be auto‑applied in future tests.

Log and metric collection: Logs are captured via emptyDir, shipped to Kafka, and indexed in OpenSearch; metrics are reported through a custom Kafka Metrics Reporter and stored in TDengine.

Alerting: Basic alerts monitor job liveness; complex alerts detect checkpoint failures, back‑pressure, etc., with configurable thresholds and plans for automatic threshold recommendation.

Intelligent diagnosis: A rule‑based engine correlates logs, metrics, and events to pinpoint root causes and suggest remediation steps; currently in gray‑scale testing.

Batch operations: Users can select jobs by node, tag, or manual input and perform concurrent start/stop or cross‑cluster migration, with progress notifications sent to operation groups.

Future Directions

Planned improvements include moving snapshot and resource files to object storage, implementing automatic job autoscaling based on traffic patterns, and building a lakehouse using Apache Paimon (formerly Flink Table Store) to enable unified batch‑stream processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Flink Kubernetes resource optimization Real-Time Computing

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.