How to Build a Scalable APM System: Inside the Dog Architecture
This article explains what an APM system is, compares logs, traces and metrics, reviews popular tools, and then details the design and implementation of the in‑house Dog APM platform—including client data models, Kafka pipelines, processing pipelines, storage in ClickHouse/Cassandra, and UI visualizations.
APM Introduction
APM (Application Performance Management) systems monitor logs, traces and metrics of distributed Java web services. The article uses a fictional system named Dog to illustrate design choices.
APM Overview
Dog aims to ingest data from most company applications, handling 500‑1000 MB/s overall (≈100 MB/s per node) on ordinary AWS EC2 instances. The article assumes Java services deployed as micro‑services with instances on different IPs.
Logs
Logs collect and query application‑generated log lines. Centralised storage (e.g., ELK: Elasticsearch, Logstash, Kibana) replaces manual SSH log inspection. Logstash can receive logs via FileBeat or directly from Kafka. Each log entry includes IP, thread, class, timestamp, traceId, message, etc., and is retained for at least a week.
Traces
Traces record the call chain of a request across services, showing latency per node and exceptions. OpenTracing defines a standard API; Jaeger and SkyWalking follow it, while Zipkin and Pinpoint do not. Visualisations (e.g., sequence diagrams) help pinpoint slow or failing nodes.
Metrics
Metrics aggregate statistical data such as request counts, P90 latency, error rates. The open‑source Cat system (by Dianping) provides rich metric dashboards.
Dog Overview
Dog focuses on metrics with optional tracing. Automatic instrumentation covers HTTP entry, MySQL (MyBatis interceptor), Redis (RedisTemplate enhancement), cross‑service calls (Feign/Dubbo/Grpc proxies), HTTP client calls, and log error reporting. Manual instrumentation is also supported via a plugin.
Client Data Model
Each Message has type, name, status fields, enabling aggregation by type, type+name, or type+name+status. Two subclasses exist: Event – counts occurrences. Transaction – includes duration and can nest children to form a trace tree.
Sample code demonstrates creating nested transactions and events:
public void test() {
Transaction transaction = Dog.newTransaction("URL", "/test/user");
try {
Dog.logEvent("User", "name-xxx", "status-yyy");
// do something
Transaction sql = Dog.newTransaction("SQL", "UserMapper.insert");
transaction.setStatus("xxxx");
transaction.setSuccess(true);
} catch (Throwable throwable) {
transaction.setSuccess(false);
transaction.setData(Throwables.getStackTraceAsString(throwable));
throw throwable;
} finally {
transaction.finish();
}
}Dog Server Design
Clients send Tree objects (root transaction + environment info) to Kafka. A single consumer thread batches messages, deflates the tree into a flat structure, and dispatches it to two Disruptor pipelines (single‑producer, single‑consumer) for high‑performance processing.
Processors
Transaction Processor – aggregates per‑minute statistics (count, failCount, min, max, avg, P90/P95/P99) keyed by time, host, type/name/status, and stores results in ClickHouse.
Sample Processor – retains up to 5 successful, 5 failed, and 5 slow samples per minute for each key.
Problem Processor – collects messages with success=false for error reporting.
Heartbeat Processor – gathers system health metrics (CPU load, GC stats, heap usage, thread counts) and stores them in ClickHouse.
Business Processor – persists optional business data (userId, bizId, extensible fields) for downstream analysis.
Storage
Metrics and samples are written to ClickHouse in bulk (≥10 000 rows per batch). Full trace trees are compressed (gzip) and stored in Cassandra keyed by treeId to enable later reconstruction of request flows.
Conclusion
Dog mirrors many concepts from Cat but implements its own pipeline, leveraging Kafka, Disruptor, ClickHouse, and Cassandra to provide a lightweight, metrics‑centric APM solution with optional tracing capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
