How to Build a TB‑Scale Log Monitoring System with the ELK Stack
This article explains how to design and implement a centralized log monitoring platform using the ELK stack, Filebeat, Elastic APM, Prometheus, and Kafka Streams to collect, filter, visualize, and alert on petabyte‑level logs across thousands of microservices.
Our Solution
We built a log monitoring system that centralizes log collection, filtering, and visualization, enabling operations and development teams to access actionable data from petabyte‑scale logs.
The system provides unified log collection and cleaning, as well as visual dashboards, monitoring, alerts, and searchable logs.
Functional Flow Overview
Our Architecture
① Log collection agents use FileBeat on each service node. The backend UI allows configuration of one FileBeat per machine, with topics that can be one‑to‑one or many‑to‑one based on log volume. In addition to business logs, we collect MySQL slow‑query and error logs, as well as third‑party logs such as Nginx. Deployment is automated through our CI platform, which starts the FileBeat processes.
② We use Elastic APM as a side‑car to gather call stacks, tracing, process metrics, and SQL without modifying application code. APM captures HTTP call chains, internal method stacks, SQL statements, CPU and memory usage. However, it does not support all languages (e.g., C) and cannot collect non‑error logs surrounding an error, nor can it distinguish custom business exceptions from system exceptions, which may cause noisy alerts.
③ The APM agent is extended to collect detailed GC, heap, memory, and thread information.
④ Server‑level metrics are collected by Prometheus .
⑤ As a SaaS provider with many services, we cannot enforce a unified logging format across legacy services, and forcing code changes is impractical. Many logs are low‑value debug statements that increase storage costs.
To address resource constraints, we filter, clean, and dynamically adjust log collection priorities. All logs are first ingested into a Kafka cluster with a short retention window (typically one hour), allowing us to free up storage on the original services.
⑥ Log Streams is our log filtering and cleaning service. It uses Kafka Streams for ETL processing, with a UI for dynamic rule configuration. Sample rules include:
Collect all error‑level logs by default.
Open a time window around error events to also collect surrounding info‑level logs.
Each service can configure up to 100 key logs for full collection.
Filter slow SQL based on business categories.
Real‑time aggregation of business SQL for DBA optimization.
Dynamic cleaning based on peak load, log level, and per‑service limits.
Adjust time windows dynamically.
Generate indices per service and log level with date suffixes for familiar developer access.
Log Visualization
We use Grafana for dashboards, which seamlessly integrates with Prometheus and Elasticsearch. Kibana is used for APM visual analysis.
We hope this design helps you build a scalable, cost‑effective log monitoring platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
