Li Auto’s Flink on Kubernetes Data Integration Practice
This article presents Li Auto’s end‑to‑end data integration journey, detailing the evolution of its data platform, the challenges of heterogeneous sources, and how a unified Flink‑on‑K8s solution with cloud‑native architecture, operator management, monitoring, and checkpointing addresses batch‑stream convergence and future scalability.
Introduction – The article introduces Li Auto’s practical implementation of data integration using Flink on Kubernetes.
Development Stages – Four phases are described: (1) July 2020 – offline data exchange built with DataX; (2) July 2021 – real‑time processing platform based on Flink; (3) September 2022 – first integration chain (Kafka → Hive); (4) April 2023 – unified batch‑stream capabilities.
Current Pain Points – Multiple heterogeneous data sources require different engines (DataX, Flink, Spark SQL, native DB engines), leading to product capability gaps, diverse development languages, difficulty sharing resources, and low resource utilization.
Key Requirements – A unified platform to hide engine differences, a single compute engine for batch and stream, and separation of compute and storage for elastic scaling.
Solution Choice – Flink was selected for its unified batch‑stream engine and native Kubernetes support, enabling elastic scaling of compute and storage resources.
Platform Architecture – The storage layer uses JuiceFS + BOS with node‑local storage; the compute layer runs Flink kernels extended with various connectors, packaged as a standard image and managed via Flink Operator on K8s. A history service records task status and logs.
Design Model – Data integration is modeled as source‑to‑sink plugins; users define sources, sinks, and transformation logic, allowing rapid addition of new connectors (e.g., MySQL).
Typical Scenarios – Offline integration synchronizes table relationships and handles full‑/incremental loads via a scheduler; real‑time pipelines use Flink CDC with parallelism tuning to avoid time‑outs; Hive partitioning strategies and automatic Kafka partition detection are also covered.
Heterogeneous Source Handling – Source and target types are mapped to Flink types, abstracting away individual database type differences.
SQL‑Based Filtering – Common WHERE‑condition functions are provided for filtering during transformation.
Cloud‑Native Implementation – Flink Operator manages the lifecycle of Flink jobs on K8s, providing declarative deployment, ingress‑based UI access, checkpointing, restart strategies, and stateful/stateless upgrades.
Monitoring & Alerting – Task metrics are exported to Prometheus; alerts trigger notifications when jobs fail, finish, cancel, or suspend.
Shared Storage – JuiceFS is mounted via CSI on each pod, storing checkpoints and history for fault‑tolerant recovery.
Future Plans – Expand support for more data sources, improve elastic scaling, enhance massive data transfer performance, and address Flink’s current limitation of predicate push‑down for WHERE‑filtered jobs.
Conclusion – The case study demonstrates how a unified, cloud‑native Flink platform on Kubernetes can resolve data integration challenges and provide scalable, maintainable batch‑stream processing for large‑scale automotive data workloads.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.