Practices for Monitoring, Resource Optimization, and Containerization of Large-Scale Flink Jobs at Beike
This article describes Beike's real‑time computing team's end‑to‑end practices for collecting and storing Flink metrics, building visual monitoring dashboards, implementing multi‑level alerting, analyzing logs, estimating CPU and memory resources, and deploying Flink on Kubernetes with containerization and storage separation to improve stability, resource utilization, and operational efficiency.
Flink, as a next‑generation real‑time computation engine, is widely used at Beike for scenarios such as real‑time metrics, ETL, and monitoring, with over 4,000 online jobs processing trillions of records daily. Enhancing stability, resource utilization, and operational efficiency for large‑scale Flink jobs is a challenging problem.
Metric Collection and Storage – Flink provides rich built‑in metrics. Initially, metrics were reported to InfluxDB via the official reporter, but scaling issues arose due to InfluxDB's single‑node limitation. The platform upgraded to a unified collection solution: metrics are filtered via Redis rules, sent to Kafka through a custom reporter, and then stored in both InfluxDB (full metrics) and Druid (core metrics such as CPU load, memory, GC, checkpoint). A distributed InfluxDB cluster with OpenResty proxy and Lua scripts was built to achieve high‑throughput writes and millisecond‑level queries.
Monitoring and Alerting – Using the full‑metric data in InfluxDB, comprehensive dashboards were created to show task health scores based on stability, checkpoint success, operator back‑pressure, resource usage, and GC frequency. Alerts are categorized into basic (e.g., checkpoint success rate), custom (user‑defined metrics), and platform‑level (e.g., task restart counts). Core metrics are stored in Druid for efficient aggregation and alert detection.
Log Analysis – Error logs are collected via Log4j Kafka appender, written to Kafka as JSON, and processed in two streams: one stores detailed logs in Elasticsearch, the other aggregates key dimensions into Druid for further analysis and alerting.
Resource Estimation – Resource usage is analyzed at both the operator‑chain level (enhancing metrics to include per‑message processing time, detecting data skew, and low‑efficiency operators) and the resource‑metric level (memory and CPU). Memory estimation follows Java performance guidelines, using post‑Full‑GC old‑generation size, while CPU estimation relies on 90th‑percentile load and per‑message processing cost.
Containerization – Deploying Flink on Kubernetes brings fine‑grained resource control, dynamic scaling, and better isolation. Three deployment options are compared: Standalone on K8s, Native Flink on K8s, and Flink K8s Operator. The Operator provides lifecycle management, declarative APIs, and automatic savepoint handling, and was ultimately chosen.
Storage‑Compute Separation – Using PVC/PV and ChubaoFS, Flink separates checkpoint, log, and dependency storage, enabling persistent, high‑performance, and concurrent access across pods.
Network Optimization – Calico BGP mode is used to reduce network overhead compared to VXLAN‑based plugins, improving data‑plane performance for high‑throughput task communication.
Production Optimizations – Dual‑Operator architecture caches task status in MySQL to alleviate K8s API server pressure, and startup latency is reduced by shortening readiness probes and simplifying launch scripts, cutting task start time from ~4 minutes to ~2 minutes.
Overall, more than 1,000 Flink jobs now run on K8s, achieving a 35 % improvement in resource utilization and reduced operational cost, with future work focusing on automatic parallelism adjustment, machine‑learning‑driven resource prediction, and enhanced HA mechanisms.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
