Big Data 12 min read

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.

DataFunTalk
DataFunTalk
DataFunTalk
Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

Background : Kuaishou’s Flink architecture has evolved over five years through three stages—initial real‑time platform construction (2018‑2020), deep optimization for stability and scale (2021‑2022), and migration to Kubernetes along with runtime adaptation and AI integration (2022‑2023).

Application Scenarios : Flink is used extensively for real‑time data streams (audio‑video, recommendation), unified batch‑stream processing, and large‑scale AI workloads, handling over a million CPU cores, 10‑billion events per second, and petabytes of daily data.

Architecture Evolution : Early deployments ran on Yarn due to scheduling performance and Hadoop ecosystem integration. From 2022‑2023 the system shifted to Kubernetes for unified resource and application management, better isolation, and cloud‑native benefits.

Current Architecture : The stack consists of a resource/storage layer (K8s/Yarn, HDFS, Kwaistore), a compute layer (Flink Streaming & Batch runtime), an application layer (online/offline platforms), and a business layer serving various company departments.

Production Refactoring : The migration to K8s addressed core pain points—smooth Yarn‑to‑K8s transition, minimal user impact, unified resource abstraction, and extensive testing (integration, fault, performance, regression). System components such as Dispatcher, Resource Manager, JobMaster, LogService, MetricReporter, Ingress/Service, and Kwaistore were redesigned for Kubernetes.

Observability & Debugging : Metrics from Flink and K8s are aggregated via a KafkaGateway to a unified OLAP store, reducing metric explosion. A dedicated log service decouples logs from pod lifecycles, storing them on hostPath and exposing them via a web service for easier troubleshooting.

Migration Practice : Migration includes seamless user‑level configuration switching between Yarn and K8s, batch migration using Flink queues, health checks with a 0‑10 scoring model, and one‑click rollback for unhealthy jobs.

Future Refactoring : Planned work focuses on compute‑storage separation with Kwaistore, priority‑based resource preemption, runtime adaptation for dynamic scaling, and unifying real‑time, near‑real‑time, and batch jobs on Kubernetes.

Overall, the sharing outlines Kuaishou’s comprehensive journey of scaling Flink on Kubernetes, emphasizing architecture redesign, operational robustness, and forward‑looking enhancements.

MigrationCloud Nativebig dataFlinkobservabilityKubernetes
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.