Big Data 14 min read

Practical Experience of Flink on Kubernetes at Kuaishou

This article presents Kuaishou's comprehensive journey of adopting Flink on Kubernetes, covering its background, evolution, architecture, production migration, observability, testing, and future plans, and demonstrates how large‑scale streaming workloads are transformed to a cloud‑native environment.

DataFunSummit

Oct 13, 2023

Practical Experience of Flink on Kubernetes at Kuaishou

Background

Kuaishou has been evolving its Flink architecture over five years, which can be divided into three stages: 2018‑2020 (building a real‑time computing platform and achieving production readiness in Flink runtime, SQL, and state engine), 2021‑2022 (optimizing usability, stability, and functionality for larger scale scenarios and exploring stream‑batch and lake‑warehouse integration), and 2022‑2023 (migrating Flink to Kubernetes, implementing runtime adaptation, large‑scale AI applications, and ecosystem improvements).

Flink Application Scenarios at Kuaishou

Real‑time data streams for core services such as audio‑video and recommendation.

Unified stream‑batch and lake‑warehouse construction.

Large‑scale AI workloads including feature engineering and data processing, with over one million CPU cores, 10‑20k Flink jobs, peak throughput exceeding 1 billion events per second, and daily data volume over 100 trillion.

Flink Architecture Evolution

2018‑2021: Flink ran on Yarn because Yarn offered better scheduling performance for tens of thousands of nodes and seamless integration with the Hadoop ecosystem.

2022‑2023: Flink migrated to Kubernetes to leverage a unified cloud‑native ecosystem, unified resource and application management, better isolation, and more stable production guarantees.

Current Overall Architecture

The architecture consists of four layers:

Resource & storage layer – Kubernetes and Yarn for resources; HDFS and Kwaistore for storage.

Compute layer – Flink Streaming & Batch providing a unified runtime.

Application layer – separated online and offline platforms.

Business layer – serving all company departments.

Production Transformation

To follow the cloud‑native trend, Kuaishou developed and migrated Flink to Kubernetes.

Core Pain Points

Design – achieving a smooth transition from Yarn to Kubernetes with minimal user impact.

Development – Flink 1.10 used in production required extensive refactoring due to missing features and bugs.

Testing – comprehensive testing is required.

System Design

The user interface layer abstracts the underlying cluster; users select Yarn or Kubernetes at job submission, while the backend unifies Yarn queues and Kubernetes custom resources (CRDs) to provide a seamless experience.

Feature Development

(1) Overall Architecture – consists of three parts: Flink client (defines pod templates and job topology), Kubernetes master (control and storage, launches Flink tasks, stores metadata in ETCD), and Flink components (Dispatcher, Resource Manager, JobMaster, LogService, MetricReporter, Ingress/Service, Kwaistore).

(2) Runtime Modes – session mode (long‑living cluster, multiple jobs, isolation issues), perJob mode (single‑job, better isolation, primary mode at Kuaishou), application mode (single‑job, thin client, distributes launch pressure, initially incomplete but later improved).

(3) Observability – Flink provides metrics for throughput, memory, CPU, and checkpoints, but Kubernetes‑level metrics cause connection explosion and Prometheus scaling problems; a unified view for Yarn and Kubernetes metrics is needed.

Solution: a KafkaGateway aggregates metrics from Yarn (machine‑level aggregation) and Kubernetes (pod‑level aggregation) before forwarding them to a unified OLAP engine and Grafana.

(4) Issue Diagnosis – logs disappear when pods terminate; solution: detach logs from pods, store them on hostPath, expose via a web service, and optionally forward to Elasticsearch.

(5) Testing – includes integration testing, fault testing (Flink component failures, Kubernetes component failures, hardware failures), performance testing (Flink performance on K8s vs. bare metal, API server load, scheduler optimization), and regression testing.

Migration Practice

The migration work addresses four major pain points: seamless user migration, batch migration tools, health checks with one‑click rollback, and rapid resource reallocation.

1. User Migration – split configuration into common and cluster parts; the system auto‑generates cluster‑specific configs based on the selected Yarn or Kubernetes target.

2. Batch Migration – use Flink queues as migration units, prioritize low‑priority/simple jobs, generate snapshots on Yarn, restore on Kubernetes, and provide health monitoring with rollback capability.

3. Job Health Scoring – model key metrics (latency, snapshot success rate, GC, CPU usage, back‑pressure, data skew, custom metrics) and assign a 0‑10 score; unhealthy jobs can be rolled back.

4. Resource Benefits – unified resource configuration reduces operational cost and improves asset utilization.

Future Transformations

Compute‑Storage Separation – leverage Kwaistore for large state storage and eliminate Flink snapshots.

Resource Management – implement priority‑based preemption (P0‑P3 levels) and mixed offline‑online deployment with isolation and scheduling.

Runtime Adaption – provide dynamic scaling and operator add/remove capabilities.

Unified Ecosystem – consolidate real‑time, near‑real‑time, and batch jobs onto Kubernetes, enhancing service‑orientation.

Thank you for your attention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Migration Big Data Flink Observability Kubernetes

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.