Big Data 10 min read

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

Introduction

Tencent Cloud Application Performance Monitoring (APM) is an APM product that collects data via multi‑language probes and provides distributed performance analysis and fault self‑check. This article presents several optimization schemes that improve APM metric‑calculation performance by 2‑3×.

What is APM metric calculation?

APM reports raw Span data; to compute error rate, average response time, Apdex, and other indicators, the spans must be transformed into metric data and aggregated per minute using Flink. This process is called APM metric calculation.

Challenges of massive data reporting

As business integration grows, reporting traffic reaches hundreds of millions per minute, causing Flink CPU overload, node failures, high network packet loss, and unstable jobs. Expanding resources alone hits a bottleneck when a certain CU is reached.

Flink failure log
Flink failure log

Improving stability: job splitting

1. Why does high load persist after scaling?

Even though the Barad baseline job runs stably, the same program for APM becomes unstable with high CPU load due to many nodes handling network transmission.

Job splitting diagram
Job splitting diagram

2. Job splitting criteria

Business reporting volume

Reporting data latency

Business stability requirements

Overall time window of the full call chain

Based on these, the APM metric job was split into three smaller jobs.

Job splitting result
Job splitting result

Increasing throughput: Batch technique

1. Metric protocol

<code>message MetricList {
  string appID = 1;
  string namespace = 2;
  repeated Metric metrics = 3;
}</code>

Comparing APM and Barad shows APM processes far fewer messages per CU despite lower reporting volume, indicating a lower compression ratio of MetricList.

Metric compression comparison
Metric compression comparison

2. Batch solution

Adding a time window to batch MetricList improves compression, reduces network transmission, and raises Flink processing efficiency.

Batch optimization diagram
Batch optimization diagram

Reducing memory: dimension pruning

APM currently converts all Span tag dimensions to Metric fields, causing unnecessary Kafka and Flink resource consumption. By limiting to 17 essential dimensions, memory usage drops.

Dimension pruning
Dimension pruning

Overall optimization results

After applying job splitting, batch merging, and dimension pruning, Flink resource consumption decreased to about 30% of the original.

Business Name

Before Optimization

After Optimization

Some Business

650 CU

230 CU

The optimizations significantly improved APM metric calculation performance and reduced resource usage, demonstrating that fine‑grained improvements in big‑data processing yield large gains at scale.

Big DataFlinkAPMPerformance Tuningcloud monitoringMetric Optimization
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.