Operations 8 min read

Building a TB‑Scale Log Monitoring System with ELK Stack

This article explains how to design and implement a TB‑level log monitoring system for microservice environments using the ELK stack, detailing log collection with FileBeat, tracing via Elastic APM, resource‑efficient processing with Kafka Streams, and visualization through Grafana and Kibana.

Architecture Digest
Architecture Digest
Architecture Digest
Building a TB‑Scale Log Monitoring System with ELK Stack

In large‑scale microservice environments, logs are essential for troubleshooting, performance tuning, and business analysis, but scattered local storage makes retrieval difficult.

Solution Overview : Centralize log collection, filtering, and cleaning using the ELK stack, providing operational and development teams with actionable data.

Architecture :

Log collection agents: FileBeat is deployed on each service node, configurable via a backend UI, handling both application logs and auxiliary logs such as MySQL slow queries and Nginx errors.

Tracing and metrics: Elastic APM gathers HTTP call chains, method stacks, SQL statements, CPU and memory usage without code changes, though it lacks support for some languages (e.g., C) and cannot capture custom non‑error logs.

Extended agents: Custom modifications collect detailed GC, heap, thread, and memory information.

Server‑side metrics: Prometheus monitors host‑level indicators.

Log processing: All logs are streamed into a Kafka cluster with a short retention window (one hour). Kafka Streams implements ETL filtering, cleaning, and dynamic rule configuration via a UI.

Filtering rules include: always collect error‑level logs, windowed collection around error timestamps, per‑service key‑log limits, slow‑SQL filtering, and time‑based dynamic thresholds.

Visualization: Grafana (integrated with Prometheus and Elasticsearch) and Kibana (for APM data) provide dashboards, alerts, and searchable interfaces.

The system balances resource consumption by filtering out low‑value logs, using dynamic windows and priority settings, thereby avoiding excessive storage costs while still delivering timely insights for operations and development teams.

ELKGrafanalog monitoringKafka StreamsFilebeatelastic apm
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.