Building a Scalable APM System: Inside Dog’s Architecture and Implementation
This article explains what an APM system is, explores the three core components—Logs, Traces, and Metrics—and details the design and implementation of the Dog APM platform, covering client data models, automatic instrumentation, server processing pipelines, and storage strategies.
Introduction
This article explains what an APM (Application Performance Management) system is, why it matters, and how to build one. The example system is called Dog , which aims to ingest data from most company applications at a rate of 500‑1000 MB/s, using multiple ordinary AWS EC2 instances.
APM Overview
APM consists of three main aspects: Logs , Traces , and Metrics . When evaluating any APM solution, consider these three dimensions.
In some contexts, APM may refer only to Metrics; this article does not focus on that narrow definition.
Logs
Logs collect and query the raw log output from applications. Centralized log storage and searchable interfaces are essential because manual SSH access to many machines is inefficient. The typical implementation is the ELK stack (Elasticsearch, Logstash, Kibana). Logstash gathers logs via Filebeat or direct Kafka ingestion, and Kibana visualizes the data.
Traces
Traces record the call chain of a request, showing which services and methods were invoked, their latency, and where exceptions occurred. OpenTracing provides a standard API for client instrumentation. Popular tracing systems include Jaeger, SkyWalking (OpenTracing‑compatible), and Pinpoint (rich manual instrumentation). The article includes example sequence diagrams and UI screenshots of SkyWalking, Pinpoint, and Jaeger.
Metrics
Metrics focus on statistical data such as request counts, latency percentiles (P90, P99), and error rates. The open‑source Cat system (by Dianping) provides a rich transaction view and problem view for metrics analysis.
Dog Overview
Dog is a Metrics‑centric APM system that also offers limited tracing. It ingests data via Kafka, stores core data in Cassandra and ClickHouse, and provides UI components for transaction reports, sample inspection, problem analysis, and heartbeat monitoring.
Client Data Model
Each message contains type, name, and status. Three aggregation dimensions are supported: type alone, type+name, and type+name+status. Messages are either Event (count‑only) or Transaction (includes duration for statistical calculations). Transactions can nest, forming a tree for tracing.
Client Design
Dog provides a Java client library. Automatic instrumentation is achieved via filters, MyBatis interceptors, Javassist‑enhanced RedisTemplate, Feign/Dubbo/Grpc proxies, and HttpClient/OkHttp interceptors. Manual instrumentation can be added with Dog.logEvent and Dog.newTransaction. The client assembles a Tree (root transaction plus metadata) and sends it to Kafka.
public void test() {
Transaction transaction = Dog.newTransaction("URL", "/test/user");
try {
Dog.logEvent("User", "name-xxx", "status-yyy");
Transaction sql = Dog.newTransaction("SQL", "UserMapper.insert");
transaction.setStatus("xxxx");
transaction.setSuccess(true);
} catch (Throwable t) {
transaction.setSuccess(false);
transaction.setData(Throwables.getStackTraceAsString(t));
throw t;
} finally {
transaction.finish();
}
}The treeId format is ${appName}-${encode(ip)}-${minute}-${incrementalId}, enabling cross‑service trace correlation.
Dog Server Design
The server consumes Tree objects from Kafka, deflates them into flat structures, and dispatches them to two Disruptor pipelines (single‑producer, single‑consumer) for high‑performance processing. Processors handle transactions, samples, problems, heartbeats, full message trees, and business data.
Transaction Processor
Aggregates statistics per minute, keyed by appName, ip, type, name, and status. Stores count, failCount, min, max, avg, and percentile estimates (using Apache DataSketches) in ClickHouse.
Sample Processor
Keeps up to five successful, five failed, and five slow samples per minute for each type+name+status combination, enabling detailed trace inspection.
Problem Processor
Collects all messages with success=false into a problem list for error statistics and sample storage.
Heartbeat Processor
Aggregates system health metrics (CPU load, memory, thread counts, GC stats, heap usage) into a Map<String, Double> and stores them in ClickHouse for dashboard visualisation.
MessageTree Processor
Persists the full tree (including trace data) in Cassandra as a gzipped blob, keyed by treeId, to support complete trace reconstruction across services.
Business Processor
Stores optional business‑specific data (userId, bizId, ext1‑ext3, extVal1‑extVal2) in ClickHouse, allowing downstream analytics without the APM system needing to understand the semantics.
Conclusion
Dog combines ideas from existing open‑source APM tools (Cat, Pinpoint, SkyWalking) with a lightweight, Metrics‑first approach. It demonstrates how to design a scalable monitoring pipeline using Kafka, Disruptor, ClickHouse, and Cassandra, while providing enough tracing capability for most debugging scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
