Building a Real‑Time Service Monitoring Framework with Flink at NetEase Cloud
This article explains how NetEase Cloud Communication designed and implemented a Flink‑based streaming aggregation framework that processes massive heartbeat logs in real time, handles data skew with two‑stage aggregation, and outputs metrics to Kafka and InfluxDB for monitoring and alerting.
Overall Architecture
NetEase Cloud Communication, a PaaS service, needs real‑time monitoring of service health indicators such as heartbeat, pulse, and blood pressure. The monitoring platform collects massive, unordered SDK and server logs, performs real‑time analysis and aggregation, and visualizes core metrics for stakeholders.
Modules
source : Periodically loads aggregation rules, creates Kafka consumers as needed, and continuously consumes data.
process : Handles grouping, windowing, aggregation, and ring‑ratio calculations. To mitigate data skew, aggregation is split into two stages.
sink : Outputs aggregated metrics to Kafka (for downstream processing) and InfluxDB (for visualization and queries).
reporter : Collects full‑link statistics such as QPS, latency, backlog, and late data volume.
Source – Rule Configuration
Key parameters for metric computation are abstracted and managed via a visual configuration page, allowing easy creation and maintenance of aggregation rules.
Data Consumption
Metrics sharing the same data source use a common Kafka consumer; after parsing, each metric invokes collect() for distribution. Optional filter rules discard non‑matching records before distribution.
Process – Core Aggregation Code
SingleOutputStreamOperator<MetricContext> aggResult = src
.assignTimestampsAndWatermarks(new MetricWatermark())
.keyBy(new MetricKeyBy())
.window(new MetricTimeWindow())
.aggregate(new MetricAggFuction());The supporting functions are: MetricWatermark(): Generates watermarks from a configured timestamp field. MetricKeyBy(): Defines aggregation dimensions, similar to SQL GROUP BY. MetricTimeWindow(): Configures sliding or tumbling windows based on rule settings. MetricAggFuction(): Implements various aggregation operators described later.
Two‑Stage Aggregation for Data Skew
To address hot keys caused by data skew, the framework performs:
Stage 1: Randomly shuffles data and performs pre‑aggregation.
Stage 2: Aggregates the pre‑aggregation results to produce final metrics.
If the parallelism parameter exceeds 1, a random key is generated and concatenated with the original grouping key to achieve the shuffle.
Aggregation Operators
The platform provides common operators, optimized for large‑scale data:
median / tp90 / tp95 : Approximate percentiles using a non‑uniform histogram algorithm similar to Hive’s percentile_approx.
count‑distinct : Exact distinct count via RoaringBitmap compression.
count‑distinct (approx.) : Approximate distinct count using HyperLogLog.
Post‑Processing
Composite Metrics : Combine basic metrics (e.g., success rate = successes / attempts) based on configurable rules.
Relative Metrics : Compute period‑over‑period changes (YoY, MoM) using Flink state.
Handling Abnormal Data
Late data beyond the allowed lateness is collected via sideOutputLateData and reported; minor lateness triggers window recomputation with FIRE_AND_PURGE to avoid double counting. Early data caused by clock drift is manually adjusted to prevent watermark disruption.
Sink
Aggregated metrics are sent to:
Kafka : Topic named after the metric identifier for downstream processing such as alert generation.
InfluxDB : Time‑series table named after the metric identifier for API queries and dashboards.
Reporter
Using InfluxDB + Grafana, the system monitors QPS, latency, backlog, and late data across all pipeline stages, providing full‑link visibility.
Summary
The generic aggregation framework now powers over 100 distinct metrics for NetEase Cloud Communication, delivering rapid, configuration‑driven metric creation (development time reduced from days to minutes), simplified maintenance (one Flink job for all metrics, resource usage cut from 300+ CU to 40 CU), and transparent operation through comprehensive monitoring.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
