Big Data 10 min read

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

Efficient Ops

Mar 29, 2022

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

Introduction

Tencent Cloud Application Performance Monitoring (APM) is an APM product that collects data via multi‑language probes and provides distributed performance analysis and fault self‑check. This article presents several optimization schemes that improve APM metric‑calculation performance by 2‑3×.

What is APM metric calculation?

APM reports raw Span data; to compute error rate, average response time, Apdex, and other indicators, the spans must be transformed into metric data and aggregated per minute using Flink. This process is called APM metric calculation.

Challenges of massive data reporting

As business integration grows, reporting traffic reaches hundreds of millions per minute, causing Flink CPU overload, node failures, high network packet loss, and unstable jobs. Expanding resources alone hits a bottleneck when a certain CU is reached.

Improving stability: job splitting

1. Why does high load persist after scaling?

Even though the Barad baseline job runs stably, the same program for APM becomes unstable with high CPU load due to many nodes handling network transmission.

2. Job splitting criteria

Business reporting volume

Reporting data latency

Business stability requirements

Overall time window of the full call chain

Based on these, the APM metric job was split into three smaller jobs.

Increasing throughput: Batch technique

1. Metric protocol

message MetricList {
  string appID = 1;
  string namespace = 2;
  repeated Metric metrics = 3;
}

Comparing APM and Barad shows APM processes far fewer messages per CU despite lower reporting volume, indicating a lower compression ratio of MetricList.

2. Batch solution

Adding a time window to batch MetricList improves compression, reduces network transmission, and raises Flink processing efficiency.

Reducing memory: dimension pruning

APM currently converts all Span tag dimensions to Metric fields, causing unnecessary Kafka and Flink resource consumption. By limiting to 17 essential dimensions, memory usage drops.

Overall optimization results

After applying job splitting, batch merging, and dimension pruning, Flink resource consumption decreased to about 30% of the original.

Business Name

Before Optimization

After Optimization

Some Business

650 CU

230 CU

The optimizations significantly improved APM metric calculation performance and reduced resource usage, demonstrating that fine‑grained improvements in big‑data processing yield large gains at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink APM Performance Tuning Cloud Monitoring metric optimization

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.