How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey
This article recounts a year‑long, hands‑on experience of deploying and continuously optimizing Apache SkyWalking for full‑link monitoring in a large micro‑service environment, covering the motivations, architecture choices, pre‑research, POC integration, and a series of performance‑tuning steps that reduced segment storage from billions to millisecond‑level query latency.
Background
SkyWalking is an open‑source APM inspired by Google Dapper. The company adopted version 8.5.0 for a micro‑service environment with thousands of JVM instances and tens of billions of Segment records per day.
Architecture
SkyWalking consists of a Java Agent , an OAP server (receiver, aggregator, mixed roles) and a storage backend. Agents send trace and metric data to OAP via gRPC or Kafka; OAP aggregates metrics (L1 and L2) and stores traces/metrics in Elasticsearch. The data flow is:
Agent → (gRPC/Kafka) → OAP (Receiver) → L1 aggregation → OAP (Aggregator) → L2 aggregation → ElasticsearchFour sub‑applications are deployed for flexibility: Webapp, Agent, OAP‑Receiver, OAP‑Aggregator.
Potential side effects
Functional interference – none observed because plugins are isolated with a custom ClassLoader.
Performance overhead – typically under 5 % due to ByteBuddy bytecode enhancement and a lock‑free MPSC ring buffer.
Proof‑of‑Concept integration
Embedded the Java Agent via startup scripts; application name and group are supplied by the internal configuration center.
Implemented gray‑release upgrades at application and instance granularity.
Developed missing plugins for the company’s OLTP stack and a native SDK for mobile apps.
Connected SkyWalking to the internal configuration center (similar to Apollo) to externalize settings for Ribbon, Hystrix, etc.
Self‑monitoring of SkyWalking components using kafka‑manager, Prometheus and custom dashboards.
Optimization phases
Startup time reduction
Recorded startup latency as a metric.
Optimized Kafka reporter initialization, saving 3–4 s.
Improved class‑matching and bytecode enhancement, reducing total startup from >16 s to <3 s.
Adjusted Kafka partitioning so that all metrics from a JVM go to the same partition, enabling per‑JVM L1 aggregation.
Kafka backlog mitigation
When OAP consumption lag caused segment backlogs of hundreds of millions per day, the mixed‑role cluster was split into separate Receiver and Aggregator clusters. This doubled consumption throughput and eliminated Full GC spikes.
Trace query performance
Initial 15‑day trace queries took >20 s. After migrating to an Elasticsearch hot‑warm architecture, tuning BulkProcessor threads, using SSD storage and routing queries to the hot index, latency dropped to 2–3 s and eventually to sub‑second (millisecond) levels.
Dashboard and topology speedup
Applied hot‑warm architecture to metric clusters, achieving millisecond‑level dashboard queries.
Removed unnecessary loops and full‑index scans in topology generation.
Isolated indices per service to avoid cross‑interference.
Topology page now responds within milliseconds.
Additional optimizations
Enabled ZSTD compression in Elasticsearch to reduce storage footprint.
Fine‑tuned ES routing and shard allocation for trace and metric data.
Evaluated sampling reduction and ClickHouse as an alternative store; retained Elasticsearch due to existing custom optimizations.
Key takeaways
Systematic performance engineering transformed SkyWalking from a barely usable observability layer to a high‑performance platform capable of handling billions of Segments daily with millisecond query latency. The effort also demonstrated that Java Agent overhead can be kept below 5 %, gray‑release deployment is feasible at both application and instance levels, and hot‑warm Elasticsearch architectures are effective for large‑scale trace and metric workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
