Operations 28 min read

How to Build a Scalable APM System: Inside the Dog Architecture

This article explains what an APM system is, compares logs, traces and metrics, reviews popular tools, and then details the design and implementation of the in‑house Dog APM platform—including client data models, Kafka pipelines, processing pipelines, storage in ClickHouse/Cassandra, and UI visualizations.

Java Interview Crash Guide
Java Interview Crash Guide
Java Interview Crash Guide
How to Build a Scalable APM System: Inside the Dog Architecture

APM Introduction

APM (Application Performance Management) systems monitor logs, traces and metrics of distributed Java web services. The article uses a fictional system named Dog to illustrate design choices.

APM Overview

Dog aims to ingest data from most company applications, handling 500‑1000 MB/s overall (≈100 MB/s per node) on ordinary AWS EC2 instances. The article assumes Java services deployed as micro‑services with instances on different IPs.

Logs

Logs collect and query application‑generated log lines. Centralised storage (e.g., ELK: Elasticsearch, Logstash, Kibana) replaces manual SSH log inspection. Logstash can receive logs via FileBeat or directly from Kafka. Each log entry includes IP, thread, class, timestamp, traceId, message, etc., and is retained for at least a week.

Traces

Traces record the call chain of a request across services, showing latency per node and exceptions. OpenTracing defines a standard API; Jaeger and SkyWalking follow it, while Zipkin and Pinpoint do not. Visualisations (e.g., sequence diagrams) help pinpoint slow or failing nodes.

Metrics

Metrics aggregate statistical data such as request counts, P90 latency, error rates. The open‑source Cat system (by Dianping) provides rich metric dashboards.

Dog Overview

Dog focuses on metrics with optional tracing. Automatic instrumentation covers HTTP entry, MySQL (MyBatis interceptor), Redis (RedisTemplate enhancement), cross‑service calls (Feign/Dubbo/Grpc proxies), HTTP client calls, and log error reporting. Manual instrumentation is also supported via a plugin.

Client Data Model

Each Message has type, name, status fields, enabling aggregation by type, type+name, or type+name+status. Two subclasses exist: Event – counts occurrences. Transaction – includes duration and can nest children to form a trace tree.

Sample code demonstrates creating nested transactions and events:

public void test() {
  Transaction transaction = Dog.newTransaction("URL", "/test/user");
  try {
    Dog.logEvent("User", "name-xxx", "status-yyy");
    // do something
    Transaction sql = Dog.newTransaction("SQL", "UserMapper.insert");
    transaction.setStatus("xxxx");
    transaction.setSuccess(true);
  } catch (Throwable throwable) {
    transaction.setSuccess(false);
    transaction.setData(Throwables.getStackTraceAsString(throwable));
    throw throwable;
  } finally {
    transaction.finish();
  }
}

Dog Server Design

Clients send Tree objects (root transaction + environment info) to Kafka. A single consumer thread batches messages, deflates the tree into a flat structure, and dispatches it to two Disruptor pipelines (single‑producer, single‑consumer) for high‑performance processing.

Processors

Transaction Processor – aggregates per‑minute statistics (count, failCount, min, max, avg, P90/P95/P99) keyed by time, host, type/name/status, and stores results in ClickHouse.

Sample Processor – retains up to 5 successful, 5 failed, and 5 slow samples per minute for each key.

Problem Processor – collects messages with success=false for error reporting.

Heartbeat Processor – gathers system health metrics (CPU load, GC stats, heap usage, thread counts) and stores them in ClickHouse.

Business Processor – persists optional business data (userId, bizId, extensible fields) for downstream analysis.

Storage

Metrics and samples are written to ClickHouse in bulk (≥10 000 rows per batch). Full trace trees are compressed (gzip) and stored in Cassandra keyed by treeId to enable later reconstruction of request flows.

Conclusion

Dog mirrors many concepts from Cat but implements its own pipeline, leveraging Kafka, Disruptor, ClickHouse, and Cassandra to provide a lightweight, metrics‑centric APM solution with optional tracing capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaAPMMetricsKafkaclickhousetracing
Java Interview Crash Guide
Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.