How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations
This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.
Introduction
Tencent Cloud Application Performance Monitoring (APM) is an APM product that collects data via multi‑language probes and provides distributed performance analysis and fault self‑check. This article presents several optimization schemes that improve APM metric‑calculation performance by 2‑3×.
What is APM metric calculation?
APM reports raw Span data; to compute error rate, average response time, Apdex, and other indicators, the spans must be transformed into metric data and aggregated per minute using Flink. This process is called APM metric calculation.
Challenges of massive data reporting
As business integration grows, reporting traffic reaches hundreds of millions per minute, causing Flink CPU overload, node failures, high network packet loss, and unstable jobs. Expanding resources alone hits a bottleneck when a certain CU is reached.
Improving stability: job splitting
1. Why does high load persist after scaling?
Even though the Barad baseline job runs stably, the same program for APM becomes unstable with high CPU load due to many nodes handling network transmission.
2. Job splitting criteria
Business reporting volume
Reporting data latency
Business stability requirements
Overall time window of the full call chain
Based on these, the APM metric job was split into three smaller jobs.
Increasing throughput: Batch technique
1. Metric protocol
<code>message MetricList {
string appID = 1;
string namespace = 2;
repeated Metric metrics = 3;
}</code>Comparing APM and Barad shows APM processes far fewer messages per CU despite lower reporting volume, indicating a lower compression ratio of MetricList.
2. Batch solution
Adding a time window to batch MetricList improves compression, reduces network transmission, and raises Flink processing efficiency.
Reducing memory: dimension pruning
APM currently converts all Span tag dimensions to Metric fields, causing unnecessary Kafka and Flink resource consumption. By limiting to 17 essential dimensions, memory usage drops.
Overall optimization results
After applying job splitting, batch merging, and dimension pruning, Flink resource consumption decreased to about 30% of the original.
Business Name
Before Optimization
After Optimization
Some Business
650 CU
230 CU
The optimizations significantly improved APM metric calculation performance and reduced resource usage, demonstrating that fine‑grained improvements in big‑data processing yield large gains at scale.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.