Operations 9 min read

How to Build a TB‑Scale Log Monitoring System with ELK Stack

This article explains how to design and implement a TB‑level log monitoring platform for micro‑service environments using ELK Stack, Filebeat, Elastic APM, Kafka Streams, Prometheus, and Grafana, covering data collection, filtering, storage, and visualization while addressing cost and resource constraints.

IT Architects Alliance

Oct 14, 2021

How to Build a TB‑Scale Log Monitoring System with ELK Stack

In large micro‑service deployments, thousands of services generate massive amounts of logs that are essential for troubleshooting, performance tuning, and business analysis. Storing logs locally on each node makes it difficult to locate relevant data and to extract business value.

The proposed solution centralizes log collection, processing, and visualization using the ELK Stack (Elasticsearch, Logstash, Kibana) together with complementary tools such as Filebeat, Elastic APM, Kafka Streams, Prometheus, and Grafana.

Solution Overview

The architecture consists of the following key steps:

Uniform log collection and cleansing.

Generation of visual dashboards, alerts, and searchable indices.

Log Collection Architecture

Each service node runs a Filebeat instance configured via a backend UI. Filebeat forwards logs to Kafka topics; the mapping can be one‑to‑one or many‑to‑one depending on log volume. In addition to business logs, MySQL slow‑query logs, error logs, and third‑party service logs (e.g., Nginx) are also collected.

Elastic APM agents are deployed to capture HTTP call traces, method call stacks, SQL statements, CPU and memory metrics without modifying application code. While APM covers >80% of incident detection, it does not support all languages (e.g., C) and cannot capture non‑error or custom business logs, so Filebeat remains necessary.

Additional components include:

Custom extensions to the APM agent for detailed GC, heap, thread, and memory metrics.

Prometheus for server‑level metrics collection.

Log Filtering and Storage Strategy

Because unlimited log retention is infeasible, logs are first ingested into a Kafka cluster with a short retention window (default one hour). A custom service called Log Streams performs ETL filtering using Kafka Streams, applying dynamic rules such as:

Default full collection of Error‑level logs.

Windowed collection of surrounding logs (configurable N minutes) around error timestamps, defaulting to INFO level.

Per‑service configuration of up to 100 key logs, collected in full.

Business‑specific slow‑SQL filtering and re‑aggregation.

Real‑time statistics of business SQL for DBA optimization.

Dynamic weighting and quota limits per service during peak periods.

Time‑based window adjustments.

Index naming per service and log level (debug, info, error, custom) with date suffixes.

Filtered logs are stored in Elasticsearch, while metrics are stored in Prometheus. Grafana visualizes both data sources, and Kibana is used for APM‑specific analysis.

Visualization

The final dashboards provide searchable log interfaces, alerting rules, and performance charts, enabling developers and operations teams to quickly pinpoint issues and monitor system health.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations ELK grafana Log Monitoring kafka streams filebeat elastic apm

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.