Operations 21 min read

How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey

This article recounts a year‑long, hands‑on experience of deploying and continuously optimizing Apache SkyWalking for full‑link monitoring in a large micro‑service environment, covering the motivations, architecture choices, pre‑research, POC integration, and a series of performance‑tuning steps that reduced segment storage from billions to millisecond‑level query latency.

ITPUB
ITPUB
ITPUB
How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey

Background

SkyWalking is an open‑source APM inspired by Google Dapper. The company adopted version 8.5.0 for a micro‑service environment with thousands of JVM instances and tens of billions of Segment records per day.

Architecture

SkyWalking consists of a Java Agent , an OAP server (receiver, aggregator, mixed roles) and a storage backend. Agents send trace and metric data to OAP via gRPC or Kafka; OAP aggregates metrics (L1 and L2) and stores traces/metrics in Elasticsearch. The data flow is:

Agent → (gRPC/Kafka) → OAP (Receiver) → L1 aggregation → OAP (Aggregator) → L2 aggregation → Elasticsearch

Four sub‑applications are deployed for flexibility: Webapp, Agent, OAP‑Receiver, OAP‑Aggregator.

Potential side effects

Functional interference – none observed because plugins are isolated with a custom ClassLoader.

Performance overhead – typically under 5 % due to ByteBuddy bytecode enhancement and a lock‑free MPSC ring buffer.

Proof‑of‑Concept integration

Embedded the Java Agent via startup scripts; application name and group are supplied by the internal configuration center.

Implemented gray‑release upgrades at application and instance granularity.

Developed missing plugins for the company’s OLTP stack and a native SDK for mobile apps.

Connected SkyWalking to the internal configuration center (similar to Apollo) to externalize settings for Ribbon, Hystrix, etc.

Self‑monitoring of SkyWalking components using kafka‑manager, Prometheus and custom dashboards.

Optimization phases

Startup time reduction

Recorded startup latency as a metric.

Optimized Kafka reporter initialization, saving 3–4 s.

Improved class‑matching and bytecode enhancement, reducing total startup from >16 s to <3 s.

Adjusted Kafka partitioning so that all metrics from a JVM go to the same partition, enabling per‑JVM L1 aggregation.

Kafka backlog mitigation

When OAP consumption lag caused segment backlogs of hundreds of millions per day, the mixed‑role cluster was split into separate Receiver and Aggregator clusters. This doubled consumption throughput and eliminated Full GC spikes.

Trace query performance

Initial 15‑day trace queries took >20 s. After migrating to an Elasticsearch hot‑warm architecture, tuning BulkProcessor threads, using SSD storage and routing queries to the hot index, latency dropped to 2–3 s and eventually to sub‑second (millisecond) levels.

Trace query performance
Trace query performance

Dashboard and topology speedup

Applied hot‑warm architecture to metric clusters, achieving millisecond‑level dashboard queries.

Removed unnecessary loops and full‑index scans in topology generation.

Isolated indices per service to avoid cross‑interference.

Topology page now responds within milliseconds.

Additional optimizations

Enabled ZSTD compression in Elasticsearch to reduce storage footprint.

Fine‑tuned ES routing and shard allocation for trace and metric data.

Evaluated sampling reduction and ClickHouse as an alternative store; retained Elasticsearch due to existing custom optimizations.

Key takeaways

Systematic performance engineering transformed SkyWalking from a barely usable observability layer to a high‑performance platform capable of handling billions of Segments daily with millisecond query latency. The effort also demonstrated that Java Agent overhead can be kept below 5 %, gray‑release deployment is feasible at both application and instance levels, and hot‑warm Elasticsearch architectures are effective for large‑scale trace and metric workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationAPMObservabilitySkyWalkingFull-Stack Monitoring
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.