Operations 9 min read

Building a TB‑Scale Log Monitoring System with ELK Stack and Kafka Streams

This article explains how to design and implement a terabyte‑level log monitoring platform using ELK Stack, FileBeat, Elastic APM, Kafka Streams, Prometheus, and Grafana, covering data collection, filtering, visualization, and resource‑efficient processing for large‑scale microservice environments.

Programmer DD

Jan 11, 2022

Building a TB‑Scale Log Monitoring System with ELK Stack and Kafka Streams

This article introduces a solution for building a TB‑scale log monitoring system based on the ELK Stack, aimed at enterprise microservice environments where hundreds of services generate massive logs.

Our Solution

The system unifies log collection, filtering, and cleaning, then provides visual dashboards, alerts, and searchable interfaces.

Unified log collection and cleaning.

Visual monitoring, alerting, and searchable logs.

Functional Flow Overview

Instrument each service node to collect logs in real time.

Centralized collection service filters and cleans logs, then generates visual dashboards and alerts.

Architecture

① Log collection uses FileBeat; each machine configures a FileBeat instance with topics mapped one‑to‑one or many‑to‑one based on log volume.

In addition to business logs, MySQL slow query/error logs and third‑party logs (e.g., Nginx) are also collected.

② Elastic APM agents gather call stacks, traces, process metrics, and SQL usage without modifying application code.

APM cannot capture all languages (e.g., C) and does not collect non‑error or custom business logs, so FileBeat remains necessary.

③ The APM agent is extended to collect detailed GC, heap, memory, and thread information.

④ Server‑side metrics are collected with Prometheus.

⑤ Because the SaaS platform hosts numerous services with heterogeneous log formats, a uniform logging standard is impractical without invasive code changes.

⑥ Log Streams performs filtering and cleaning; logs are first ingested into a Kafka cluster with a short retention window (one hour) to limit resource usage.

⑦ Visualization uses Grafana (integrated with Prometheus and Elasticsearch) and Kibana for APM analysis.

Log Filtering Rules (Kafka Streams)

Default full collection of Error‑level logs.

Windowed collection around error timestamps, also gathering configurable surrounding Info‑level logs.

Each service can configure up to 100 key logs for full collection.

Slow SQL queries are further filtered by business category.

Real‑time statistics of business SQL during peak periods to aid DBA optimization.

Dynamic cleaning based on service weight, log level, and per‑service quota during high‑traffic windows.

Adjust time windows dynamically based on traffic patterns.

Log indices are generated per service and log level with date suffixes to match developers' existing habits.

The filtered logs are visualized in Grafana dashboards, providing comprehensive monitoring across the entire microservice landscape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Logging prometheus ELK grafana Log Monitoring kafka streams

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.