Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto
Li Auto built a cloud‑native data‑integration platform by deploying Flink on Kubernetes, unifying batch and streaming workloads with a storage layer (JuiceFS + BOS) and Flink Operator, enabling simple source‑sink pipelines, elastic scaling, automated checkpointing, and centralized monitoring while addressing earlier fragmentation and resource inefficiencies.
This article presents Li Auto's practical experience of implementing data integration using Flink on Kubernetes (Flink on K8s).
It first outlines the evolution of Li Auto's data integration across four stages: (1) July 2020 – offline data exchange built with DataX; (2) July 2021 – real‑time processing platform based on Flink; (3) September 2022 – first integration pipeline (Kafka → Hive); (4) April 2023 – unified batch‑and‑stream capabilities.
Early fragmentation caused multiple heterogeneous data products, leading to pain points such as missing product capabilities, multiple development languages (Flink, Spark, DataX, etc.), difficulty sharing resources between batch and stream, and low resource utilization.
From these pain points three core requirements were derived: a unified platform that abstracts heterogeneous sources, a single compute engine that handles both batch and stream, and separation of compute and storage for elastic scaling.
To satisfy the requirements, Li Auto chose Flink as the unified compute engine and deployed it in a cloud‑native manner on Kubernetes. The architecture consists of a storage layer (JuiceFS + BOS) and a compute layer (Flink Operator‑based images). The operator provides a CRD, lifecycle management, history service, and standard APIs for users.
The design model defines source and sink plugins; users only need to declare a source, transformation logic, and a sink to create a data‑flow. Example pipelines include Kafka → Hive, Oracle → Kafka, and adding new connectors such as MySQL.
Typical scenarios cover offline integration (full and incremental sync via scheduling) and real‑time pipelines (CDC from Oracle to Kafka, automatic parallelism tuning, and checkpointing). The platform also handles Hive partitioning and automatic Kafka partition scaling.
For cloud‑native deployment, Flink Operator is used for task management, providing declarative YAML submission, ingress‑exposed Web UI, full lifecycle control, checkpointing, and automatic restart. Monitoring and alerting are integrated via Prometheus, and shared storage is realized with JuiceFS mounted on each pod.
Future plans include expanding supported data sources, improving elastic scaling, enhancing massive‑data transfer performance, and adding predicate push‑down for Flink batch jobs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
