Cloud Native 16 min read

How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform

This article details DeWu's evolution from a sneaker marketplace to a full‑stack e‑commerce platform and explains how its cloud‑native monitoring system, based on OpenTelemetry, ClickHouse, and object storage, was architected, optimized, and scaled to handle billions of spans daily.

MaGe Linux Operations

Sep 30, 2023

How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform

Monitoring System Evolution

DeWu App launched in 2015 as a youthful, fashion‑focused community platform and gradually expanded from sneakers to apparel, bags, cosmetics, becoming a full‑category e‑commerce site. As business scale and complexity grew, the monitoring scope expanded accordingly.

The core tech stack consists of Java and Go. Before 2021, the team adopted popular open‑source monitoring tools: Loki for logs, Prometheus ecosystem for metrics, VictoriaMetrics for storage, and Jaeger for tracing, with dashboards built on Grafana.

In 2021, a monitoring governance standard was introduced, adding application monitoring, anomaly analysis, slow MySQL statistics, and Redis hotspot analysis (including hit rate and large keys).

In 2022, the team deepened tracing capabilities, reducing storage costs by 90% through object storage (OSS), and introduced OpenTelemetry (OT) to achieve end‑to‑end link tracing, correlating metrics, traces, and logs. The front‑end charts were rebuilt with antv, moving away from Grafana.

Why OpenTelemetry?

To simplify trace data collection, DeWu moved from SDK‑based agents to a Java Agent using bytecode instrumentation, evaluating Pinpoint, SkyWalking, and OpenTelemetry, and ultimately selecting OpenTelemetry for its standardized data model and broad ecosystem support, including seamless compatibility with Prometheus and OpenTracing.

Distributed Tracing – Trace2.0 Architecture

Collection side: integrates and customizes multi‑language OT SDKs (Agent) for Java, Go, Python, and JavaScript, producing unified data.

Control plane: centralized configuration service pushes settings to collectors, supports gray‑release per instance, dynamic switches, performance profiling, and version management.

Data gateway: OTel Server compatible with OT protocol, offers gRPC and HTTP, writes data to Kafka.

Compute side: stores Span data and provides scenario‑specific analysis such as SpanMetric calculation, Redis hotspot analysis, MySQL hotspot analysis, and single‑order trace linking.

Storage side: index data stored in ClickHouse, detailed data in OSS, and metadata in a graph database.

Trace2.0 Capabilities

Trace2.0 preserves complete valuable traces, supports association of metrics and logs, and enables trace‑to‑metric and trace‑to‑log linking. The JavaAgent allows transparent injection via a JVM parameter, providing request parameters, custom instrumentation, and diagnostic tools.

Storage Layer Evolution

Phase 1 stored all trace details in ClickHouse (SpanIndex and SpanData) for high‑throughput writes and sparse indexing.

Phase 2 introduced hot‑cold separation using Kafka delayed consumption and Bloom‑filter encoding: hot cluster retains recent data (7 days) in ClickHouse; cold cluster keeps valuable traces for 30 days, reducing storage costs.

Phase 3 added object storage for detailed trace data. After consuming spans from Kafka, data is buffered in memory, compressed with ZSTD, and appended to OSS files. ClickHouse stores file offsets and locations, enabling efficient random‑read retrieval of 4 MB blocks.

Implementation Effects – Current Status

Peak ingestion of 12 million spans per second.

Trace point query P50 latency ~300 ms, P90 ~800 ms.

Daily data growth exceeds 700 TB.

Hot storage retains 6 days (4 PB) and cold storage 30 days (1 PB) with a 12× compression ratio.

ClickHouse can handle up to 400 k spans per second per node.

Observability Platform – Front‑End Monitoring

By linking front‑end exceptions to back‑end traces, developers can view the full request path, analyze latency breakdowns, and trace user sessions via SessionID, reconstructing page loads, API calls, and user actions.

Observability Platform – Container Monitoring

With full containerization, a dedicated K8s monitoring product was built to unify metric definitions, provide drill‑down from cluster to node to pod, and integrate control‑plane components (API‑Server, etcd) into a single dashboard.

Observability Platform – Application Monitoring

Interface analysis displays three key metrics, associated traces, instance dimensions, upstream/downstream analysis, and latency decomposition, helping pinpoint whether latency stems from DB, Redis, or other services.

Anomaly analysis includes exception aggregation, error‑code classification, MySQL hotspot analysis (SQL fingerprint, slow queries), and Redis hotspot analysis (hit rate, large keys, slow calls).

Trace details present all spans of a trace in a graph, with options for aggregated statistics or detailed list view, supporting custom columns (e.g., thread name, environment tag) and linking to CMDB or container platforms.

Alerting leverages Prometheus as the data source, offering over 50 templated alerts, supporting comparative rules, and delivering notifications via Feishu, SMS, or phone with minute‑, hour‑, or day‑level aggregation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Observability OpenTelemetry

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.