Operations 22 min read

Designing a Next‑Gen Observability Platform: From Zipkin to Hera

This article chronicles the evolution of a company's monitoring system from a Zipkin‑based tracing solution to a cloud‑native observability platform called Hera, detailing design goals, technology choices, challenges with MySQL storage, and the adoption of Prometheus‑compatible metrics, Jaeger tracing, and Kubernetes operators.

SQB Blog

Apr 2, 2022

Designing a Next‑Gen Observability Platform: From Zipkin to Hera

Introduction

With the rapid growth of distributed systems and micro‑services, the need for observability in development and operations has become urgent. Observability, originally borrowed from control theory, is commonly understood as three interrelated pillars: tracing, metrics, and logging.

Link Tracing

Metrics

Logging

These concepts are complementary rather than independent. The classic Venn diagram from Peter Bourgon’s article illustrates their overlap.

History of ShouQianBa Monitoring System

Since 2017 the team gradually built an application monitoring system focused on tracing and performance metrics.

Tracing: Chose Twitter’s open‑source Zipkin, using Elasticsearch as the backend store.

Metrics: Aggregated minute‑level metrics from Zipkin‑format data consumed from Kafka, stored in MySQL.

Instrumentation was provided via Java modules: MySQL driver interceptor, custom JSON‑RPC wrapper for RPC tracing, Spring HandlerInterceptor for REST interception, Spring AOP for Redis tracing, etc.

Using MySQL for time‑series storage caused severe performance problems because its storage engine is not optimized for such workloads and lacks rich time‑series query operators.

Custom aggregation of unsampled Zipkin data made upgrades difficult, especially after Zipkin stopped allowing server‑side customizations.

Business teams required frequent collector upgrades, and the intrusive instrumentation approach demanded high development effort.

Next‑Generation Monitoring System – Hera

To address the above issues the team designed a new system with three main goals:

Low storage cost: retain metrics for at least four weeks and traces for one week, eliminating Elasticsearch.

High real‑time query performance and flexibility: replace MySQL with Prometheus‑compatible storage.

Improved developer efficiency: use bytecode weaving for non‑intrusive instrumentation and tighter DevOps integration.

Distributed Tracing

The concept of distributed tracing follows Google’s Dapper paper, which defines a trace as a tree of nested RPC spans.

We tend to think of a Dapper trace as a tree of nested RPCs.

Dapper also introduces the span concept, where each span represents a basic unit of work.

Open‑source tracing projects evaluated at the time included Zipkin, Apache SkyWalking v6.6.0, and Jaeger v1.16 (open‑sourced by Uber and donated to CNCF).

Jaeger’s backend storage options and its gRPC plugin mechanism provide excellent extensibility.

+----------------------------------+                  +-----------------------------+<br/>|                                  |                  |                             |<br/>|              +-------------+    |   unix-socket    |  +-------------+            |<br/>|              |             |    |                  |  |             |            |<br/>|  jaeger-component  grpc-client +----------------------> grpc-server | plugin-impl |<br/>|              |             |    |                  |  |             |            |<br/>|              +-------------+    |                  |  +-------------+            |<br/>|                                  |                  |                             |<br/>+----------------------------------+                  +-----------------------------+<br/><br/>        parent process                                   child sub-process

The team implemented an SLS gRPC backend for Jaeger, achieving 30‑day retention, over 4 billion spans per day (≈6 TB), query latency of 3‑5 seconds, and a daily cost of about ¥70.

Metrics Monitoring

The legacy system stored metrics in a relational database, leading to bottlenecks. The new design adopts Prometheus‑compatible storage, ultimately selecting VictoriaMetrics for its superior performance, simple cluster architecture, and ease of operation.

VictoriaMetrics demonstrates excellent performance in benchmark tests.

Its author, Aliaksandr Valialkin, is also the creator of high‑performance Go components such as fasthttp.

The cluster architecture is simple: only vmstorage is stateful; other components are stateless.

For metric collection the team chose the push model, leveraging Kubernetes CRDs for service discovery and attaching labels to metrics, which is harder to achieve with pull.

Push enables easy health checks via direct /metrics endpoint queries.

Pull would require complex service discovery in large‑scale environments.

Operational statistics of VictoriaMetrics include 60‑day retention, 6.99 trillion data points (≈800 GB), ~5 million active series, insertion rate of ~130 k QPS, and P99 query latency of ~1.5 seconds.

A custom query panel limits time ranges to three days and supports job‑based queries.

Key metric plugins: Tomcat busy threads, HikariCP/Druid pool usage, Redis/Caffeine/EhCache hit rates, Kubernetes pod CPU/memory, and Docker event monitoring.

Full Cloud‑Native Adoption

By 2020 most services migrated to self‑managed Kubernetes clusters (upgraded to v1.20 and hosted on Alibaba Cloud ACK). The monitoring stack runs entirely on Kubernetes:

Java agents are delivered as Docker images and injected via InitContainers using shared EmptyDir volumes.

Jaeger components (ingester, collector) and VictoriaMetrics are managed by Kubernetes Operators with HPA for CPU/memory scaling.

Kubernetes Operator also maintains VictoriaMetrics clusters.

Future Outlook

Tail Sampling Implementation

Current head‑based sampling can miss error traces. The article reviews three industry approaches:

OpenTelemetry tailsampling using load‑balancing exporter, group‑by‑trace processor, and tailsampling processor.

ByteDance’s method flips sampling decisions on error but loses prior data.

Huolala’s solution combines Kafka delayed consumption with Bloom filters for hot and cold data paths.

Time‑Series Anomaly Detection

Time‑series anomaly detection is gaining traction. Examples include GitLab’s Prometheus‑based simple anomaly detection, CTrip’s Prophet platform, and Meituan’s order‑volume prediction model. The article notes that leveraging big data and AI for system anomaly detection is an emerging trend.

Monitoring observability Prometheus Distributed Tracing jaeger

Written by

SQB Blog

Thank you all.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.