Continuous Optimization and Practice of Flink at Kuaishou
This article presents Kuaishou's comprehensive engineering practices for improving Flink's stability, task startup latency, and SQL performance, including high‑availability Kafka connectors, fault‑recovery mechanisms, I/O reductions, asynchronous job upgrades, aggregation optimizations, and future resource‑utilization plans.
The talk, delivered by Kuaishou's real‑time computing lead Dong Tingting, outlines a series of practical optimizations applied to Flink in production.
1. Flink stability continuous optimization – Introduces a high‑availability Kafka Cluster Source that reads from dual‑cluster topics and tolerates single‑cluster failures, and a Cluster Sink that automatically switches to a healthy cluster. It also covers three Kafka‑related fault‑tolerance strategies: sink loss tolerance, one‑click lag discard, and dynamic broker‑list retrieval.
2. Flink task startup optimization – Analyzes the lengthy startup pipeline (client, JobMaster, TaskManager) and reduces I/O by sharing engine libraries, pre‑publishing user JARs, passing configuration via environment variables, and caching JobMaster file checks. It further proposes an asynchronous job‑upgrade scheme that launches the new job before the old one stops, achieving seamless switches within 20 seconds.
3. Flink SQL practice and optimization – Reports that Flink SQL accounts for ~30% of Kuaishou's streaming jobs, processing up to 400 million events per second. Optimizations address aggregation skew (mini‑batch aggregation, local‑global two‑stage aggregation, split distinct aggregation) and UDF reuse, which cuts duplicate function calls and doubles performance.
4. Future work – Plans focus on improving cluster resource balance, enhancing Flink SQL stability and resource efficiency, and exploring unified stream‑batch processing to eliminate duplicated code bases.
The article concludes with a brief recruitment notice for Kuaishou's data platform team.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
