Operations 30 min read

Building a Scalable APM System: Inside Dog’s Architecture and Implementation

This article explains what an APM system is, explores the three core components—Logs, Traces, and Metrics—and details the design and implementation of the Dog APM platform, covering client data models, automatic instrumentation, server processing pipelines, and storage strategies.

Programmer DD

Jan 4, 2022

Building a Scalable APM System: Inside Dog’s Architecture and Implementation

Introduction

This article explains what an APM (Application Performance Management) system is, why it matters, and how to build one. The example system is called Dog , which aims to ingest data from most company applications at a rate of 500‑1000 MB/s, using multiple ordinary AWS EC2 instances.

APM Overview

APM consists of three main aspects: Logs , Traces , and Metrics . When evaluating any APM solution, consider these three dimensions.

In some contexts, APM may refer only to Metrics; this article does not focus on that narrow definition.

Logs

Logs collect and query the raw log output from applications. Centralized log storage and searchable interfaces are essential because manual SSH access to many machines is inefficient. The typical implementation is the ELK stack (Elasticsearch, Logstash, Kibana). Logstash gathers logs via Filebeat or direct Kafka ingestion, and Kibana visualizes the data.

Traces

Traces record the call chain of a request, showing which services and methods were invoked, their latency, and where exceptions occurred. OpenTracing provides a standard API for client instrumentation. Popular tracing systems include Jaeger, SkyWalking (OpenTracing‑compatible), and Pinpoint (rich manual instrumentation). The article includes example sequence diagrams and UI screenshots of SkyWalking, Pinpoint, and Jaeger.

Metrics

Metrics focus on statistical data such as request counts, latency percentiles (P90, P99), and error rates. The open‑source Cat system (by Dianping) provides a rich transaction view and problem view for metrics analysis.

Dog Overview

Dog is a Metrics‑centric APM system that also offers limited tracing. It ingests data via Kafka, stores core data in Cassandra and ClickHouse, and provides UI components for transaction reports, sample inspection, problem analysis, and heartbeat monitoring.

Client Data Model

Each message contains type, name, and status. Three aggregation dimensions are supported: type alone, type+name, and type+name+status. Messages are either Event (count‑only) or Transaction (includes duration for statistical calculations). Transactions can nest, forming a tree for tracing.

Client Design

Dog provides a Java client library. Automatic instrumentation is achieved via filters, MyBatis interceptors, Javassist‑enhanced RedisTemplate, Feign/Dubbo/Grpc proxies, and HttpClient/OkHttp interceptors. Manual instrumentation can be added with Dog.logEvent and Dog.newTransaction. The client assembles a Tree (root transaction plus metadata) and sends it to Kafka.

public void test() {
  Transaction transaction = Dog.newTransaction("URL", "/test/user");
  try {
    Dog.logEvent("User", "name-xxx", "status-yyy");
    Transaction sql = Dog.newTransaction("SQL", "UserMapper.insert");
    transaction.setStatus("xxxx");
    transaction.setSuccess(true);
  } catch (Throwable t) {
    transaction.setSuccess(false);
    transaction.setData(Throwables.getStackTraceAsString(t));
    throw t;
  } finally {
    transaction.finish();
  }
}

The treeId format is ${appName}-${encode(ip)}-${minute}-${incrementalId}, enabling cross‑service trace correlation.

Dog Server Design

The server consumes Tree objects from Kafka, deflates them into flat structures, and dispatches them to two Disruptor pipelines (single‑producer, single‑consumer) for high‑performance processing. Processors handle transactions, samples, problems, heartbeats, full message trees, and business data.

Transaction Processor

Aggregates statistics per minute, keyed by appName, ip, type, name, and status. Stores count, failCount, min, max, avg, and percentile estimates (using Apache DataSketches) in ClickHouse.

Sample Processor

Keeps up to five successful, five failed, and five slow samples per minute for each type+name+status combination, enabling detailed trace inspection.

Problem Processor

Collects all messages with success=false into a problem list for error statistics and sample storage.

Heartbeat Processor

Aggregates system health metrics (CPU load, memory, thread counts, GC stats, heap usage) into a Map<String, Double> and stores them in ClickHouse for dashboard visualisation.

MessageTree Processor

Persists the full tree (including trace data) in Cassandra as a gzipped blob, keyed by treeId, to support complete trace reconstruction across services.

Business Processor

Stores optional business‑specific data (userId, bizId, ext1‑ext3, extVal1‑extVal2) in ClickHouse, allowing downstream analytics without the APM system needing to understand the semantics.

Conclusion

Dog combines ideas from existing open‑source APM tools (Cat, Pinpoint, SkyWalking) with a lightweight, Metrics‑first approach. It demonstrates how to design a scalable monitoring pipeline using Kafka, Disruptor, ClickHouse, and Cassandra, while providing enough tracing capability for most debugging scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java APM metrics Kafka ClickHouse Tracing

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.