Big Data 11 min read

Building a Real‑Time Service Monitoring Framework with Flink at NetEase Cloud

This article explains how NetEase Cloud Communication designed and implemented a Flink‑based streaming aggregation framework that processes massive heartbeat logs in real time, handles data skew with two‑stage aggregation, and outputs metrics to Kafka and InfluxDB for monitoring and alerting.

NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Building a Real‑Time Service Monitoring Framework with Flink at NetEase Cloud

Overall Architecture

NetEase Cloud Communication, a PaaS service, needs real‑time monitoring of service health indicators such as heartbeat, pulse, and blood pressure. The monitoring platform collects massive, unordered SDK and server logs, performs real‑time analysis and aggregation, and visualizes core metrics for stakeholders.

Modules

source : Periodically loads aggregation rules, creates Kafka consumers as needed, and continuously consumes data.

process : Handles grouping, windowing, aggregation, and ring‑ratio calculations. To mitigate data skew, aggregation is split into two stages.

sink : Outputs aggregated metrics to Kafka (for downstream processing) and InfluxDB (for visualization and queries).

reporter : Collects full‑link statistics such as QPS, latency, backlog, and late data volume.

Source – Rule Configuration

Key parameters for metric computation are abstracted and managed via a visual configuration page, allowing easy creation and maintenance of aggregation rules.

Data Consumption

Metrics sharing the same data source use a common Kafka consumer; after parsing, each metric invokes collect() for distribution. Optional filter rules discard non‑matching records before distribution.

Process – Core Aggregation Code

SingleOutputStreamOperator<MetricContext> aggResult = src
    .assignTimestampsAndWatermarks(new MetricWatermark())
    .keyBy(new MetricKeyBy())
    .window(new MetricTimeWindow())
    .aggregate(new MetricAggFuction());

The supporting functions are: MetricWatermark(): Generates watermarks from a configured timestamp field. MetricKeyBy(): Defines aggregation dimensions, similar to SQL GROUP BY. MetricTimeWindow(): Configures sliding or tumbling windows based on rule settings. MetricAggFuction(): Implements various aggregation operators described later.

Two‑Stage Aggregation for Data Skew

To address hot keys caused by data skew, the framework performs:

Stage 1: Randomly shuffles data and performs pre‑aggregation.

Stage 2: Aggregates the pre‑aggregation results to produce final metrics.

If the parallelism parameter exceeds 1, a random key is generated and concatenated with the original grouping key to achieve the shuffle.

Aggregation Operators

The platform provides common operators, optimized for large‑scale data:

median / tp90 / tp95 : Approximate percentiles using a non‑uniform histogram algorithm similar to Hive’s percentile_approx.

count‑distinct : Exact distinct count via RoaringBitmap compression.

count‑distinct (approx.) : Approximate distinct count using HyperLogLog.

Post‑Processing

Composite Metrics : Combine basic metrics (e.g., success rate = successes / attempts) based on configurable rules.

Relative Metrics : Compute period‑over‑period changes (YoY, MoM) using Flink state.

Handling Abnormal Data

Late data beyond the allowed lateness is collected via sideOutputLateData and reported; minor lateness triggers window recomputation with FIRE_AND_PURGE to avoid double counting. Early data caused by clock drift is manually adjusted to prevent watermark disruption.

Sink

Aggregated metrics are sent to:

Kafka : Topic named after the metric identifier for downstream processing such as alert generation.

InfluxDB : Time‑series table named after the metric identifier for API queries and dashboards.

Reporter

Using InfluxDB + Grafana, the system monitors QPS, latency, backlog, and late data across all pipeline stages, providing full‑link visibility.

Summary

The generic aggregation framework now powers over 100 distinct metrics for NetEase Cloud Communication, delivering rapid, configuration‑driven metric creation (development time reduced from days to minutes), simplified maintenance (one Flink job for all metrics, resource usage cut from 300+ CU to 40 CU), and transparent operation through comprehensive monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Flinkstream processingreal-time monitoringData SkewMetric Computationaggregation
NetEase Smart Enterprise Tech+
Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.