Operations 21 min read

How Ping An Health Scaled SkyWalking to Billions of Traces: A Full‑Link Monitoring Journey

This article recounts the end‑to‑end design, implementation, and iterative optimization of a billion‑scale full‑link tracing system at Ping An Health using SkyWalking, covering why full‑link monitoring is needed, the selection of SkyWalking, architecture choices, performance bottlenecks, and the roadmap for future enhancements.

dbaplus Community

Oct 9, 2022

How Ping An Health Scaled SkyWalking to Billions of Traces: A Full‑Link Monitoring Journey

Background

Ping An Health operates a large‑scale microservice platform with dozens of business units. To reduce communication overhead, quickly locate failures, and improve service reliability, a full‑link monitoring system capable of handling tens of billions of trace segments per day was required. Traditional monolithic tracing could not meet the scale.

Why Apache SkyWalking

SkyWalking 8.5.0 was selected because of its mature Java ecosystem, rich plugin library, active community, non‑intrusive Java Agent, streaming topology design, and fast release cycle. It supports both gRPC and Kafka transport, and provides distributed tracing, topology analysis, and metric aggregation.

Pre‑research

The team examined three layers:

Agent side – Java Agent mechanism, configuration, plugin lifecycle, data collection and reporting.

Server side – OAP roles, modules, data ingestion, metric construction, aggregation, and storage.

Storage side – Expected data volume (50‑100 billion Segments per day), Elasticsearch architecture, and resource sizing.

Potential functional interference and performance overhead (expected <5 % CPU) were also evaluated.

Architecture Design

The initial design used Kafka as the transport channel and Elasticsearch for persistent storage. Data flow:

Agents report trace Segments and metrics to OAP via the Kafka reporter (chosen for peak‑shaving).

OAP performs Level‑1 (L1) aggregation per JVM instance, producing instance‑level metrics.

A separate Aggregation cluster performs Level‑2 (L2) aggregation on the L1 results before persisting to Elasticsearch.

Both trace and metric data are stored in Elasticsearch hot‑warm clusters to support long‑term retention (up to half a year) while keeping recent data hot.

Proof‑of‑Concept (POC)

Key activities:

Integrated dozens of seed applications in a non‑production environment.

Connected SkyWalking to the internal configuration center and release platform.

Implemented gray‑scale rollout: application‑level and instance‑level version control for the Java Agent.

Developed custom plugins for the company’s OLTP stack (e.g., Dubbo, HTTP) to fill gaps in the community plugin set.

After plugin integration, trace visualizations showed complete request execution chains across services.

Optimization Phase

Four major bottlenecks were identified and resolved:

Agent startup latency – Original startup >16 s due to Kafka reporter initialization and class‑matching. Optimizations (eager reporter init, refined class‑matching, partition‑aware metric reporting) reduced startup to <3 s.

Kafka segment backlog – OAP consumption lag caused millions of Segments to accumulate. The Mixed‑role OAP cluster was split into a Receiver cluster (data ingest & L1 aggregation) and an Aggregation cluster (L2 aggregation). Bulk processing parameters for both trace and metric streams were tuned, eliminating the backlog.

Trace query latency – Queries on 25 billion daily Segments took >20 s. Migrated Elasticsearch to a hot‑warm architecture, enabled ZSTD compression, and refined shard routing. Query time dropped to a few milliseconds for both trace‑by‑ID and trace list queries.

Dashboard & topology latency – Metric dashboard and topology graph requests timed out (60 s). Internal loops were consolidated, unnecessary queries removed, and index data isolated. Response time reached millisecond‑level.

Each optimization includes before/after metrics and diagrams.

Technical Details

Java Agent uses ByteBuddy for bytecode enhancement and a lightweight lock‑free ring buffer to decouple data collection from reporting. Plugins follow a template‑method AOP pattern, ensuring that plugin failures do not affect business logic. The reporter can send data via gRPC or Kafka; Kafka was chosen for its ability to smooth traffic spikes.

Configuration & Release Integration – The startup script mounts the SkyWalking Agent with -javaagent:/path/to/skywalking-agent.jar. The release system injects SW_AGENT_NAME and SW_AGENT_GROUP environment variables to set the application name and group, enabling gray‑scale rollout at both application and instance granularity.

Performance Impact – Benchmarks on Dubbo and HTTP RPC showed CPU overhead <5 % and no functional interference. The lock‑free ring buffer uses a discard‑on‑full policy, preventing OOM.

Future Plans

Roadmap includes:

Extending SkyWalking SDKs to native mobile platforms (iOS/Android) using the SkyWalking protocol.

Implementing full‑link business state tracking to correlate trace data with high‑level business events.

Adding structured‑log tracing, storing logs in ClickHouse while keeping trace and metric data in Elasticsearch.

The goal is to provide a unified observability platform for operations, support, and development teams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

APM Elasticsearch kafka Full‑Link Tracing SkyWalking

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.