Building a Scalable TB‑Level Log Monitoring System with ELK Stack
This article explains how to design and implement a TB‑scale log monitoring solution using the ELK stack, FileBeat, Elastic APM, Kafka Streams, Prometheus and Grafana, detailing architecture, data collection, filtering, visualization, and the trade‑offs of resource usage in large‑scale microservice environments.
Our Solution
To handle TB‑level logs in a micro‑service environment, we unified log output to a central system, performed filtering and cleaning, and generated visual dashboards, alerts, and searchable indices for operations and development teams.
Key capabilities include:
Unified log collection and filtering.
Visual monitoring, alerting, and log search.
Function Flow Overview
Real‑time log collection from each service node.
Unified log collection service performs filtering, cleaning, and generates visual dashboards and alerts.
Our Architecture
We use FileBeat as the log‑file collector; each machine runs a FileBeat instance whose logs are sent to specific Kafka topics. In addition to business logs, we also collect MySQL slow‑query and error logs, as well as third‑party service logs such as Nginx.
Elastic APM agents collect call stacks, tracing, process metrics, and SQL usage without requiring code changes. However, APM does not support all languages (e.g., C) and cannot capture non‑error or custom business logs, which is why FileBeat remains necessary.
We have extended the APM agent to gather detailed GC, heap, memory, and thread information.
Server‑side metrics are collected with Prometheus.
Because our SaaS platform serves many services, we cannot enforce a unified log format across all legacy systems, so we rely on flexible collection.
To reduce resource consumption, we filter and clean logs, then ingest the full stream into a Kafka cluster with a short retention window (typically one hour).
Log Streams is our log‑filtering and cleaning service. We apply dynamic filtering rules using Kafka Streams as an ETL processor, allowing configuration of log levels, time‑windowed collection around error events, and business‑specific key logs.
Typical filtering rules include:
Default full collection of Error‑level logs.
Windowed collection of surrounding Info‑level logs around error timestamps.
Per‑service configuration of up to 100 key logs with full collection.
Business‑specific slow‑SQL filtering and aggregation.
Real‑time statistics of business SQL for DBA optimization.
Dynamic weighting and limiting of logs based on peak periods, log level, and service quotas.
Adjustable time windows for different periods.
Index naming per service and log level with date suffixes to match developer habits.
For visualization we primarily use Grafana, which integrates seamlessly with Prometheus and Elasticsearch, while Kibana is used for APM visual analysis.
Log Visualization
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
