Design and Implementation of a Distributed Real-Time Log Collection and Analysis System Using the ELK/EFK Stack
This article describes the background, requirements, architecture choices, performance testing, and lessons learned from building a large‑scale, distributed log collection and analysis platform at Hujiang using Elasticsearch, Logstash, Kibana, Filebeat, and Kafka to handle billions of log entries daily.
Hujiang, the largest online education platform in China, processes about 1 TB of logs per day (≈10⁹ entries) from multiple products, requiring a centralized system for efficient fault diagnosis, service monitoring, and data analysis.
The solution adopts the widely used ELK stack (Elasticsearch, Logstash, Kibana) and extends it to an EFK stack by adding Filebeat as a lightweight shipper. The stack versions are:
Elasticsearch 5.2.2
Logstash 5.2.2
Kibana 5.2.2
Filebeat 5.2.2
Kafka 2.10Logstash acts as the data collection and processing engine, supporting inputs, filters, and outputs. Kibana provides visualization, while Elasticsearch offers distributed search and analytics. Filebeat replaces Logstash‑forwarder, running without a Java runtime.
Simple Architecture
Logstash instances connect directly to Elasticsearch. Logstash reads logs via Input plugins (e.g., file, TCP), filters them (Grok, mutate, etc.), and writes to Elasticsearch via Output plugins.
Example Grok filter:
grok {
match => ["message", "(?m)\[%{LOGLEVEL:level}\] \[%{TIMESTAMP_ISO8601:timestamp}\] \[%{DATA:logger}\] \[%{DATA:threadId}\] \[%{DATA:requestId}\] %{GREEDYDATA:msgRawData}"]
}Cluster Architecture
Multiple Elasticsearch nodes form a cluster; Logstash runs in cluster mode, and Logstash Shipper Agents are deployed on each server to forward logs.
Drawbacks include high resource consumption on Logstash agents and potential data loss under high concurrency.
Introducing a Message Queue
To buffer spikes, logs are sent from Logstash Shipper Agents to a Kafka cluster before reaching Elasticsearch, eliminating data loss. Kafka is preferred over Redis for its durability and higher storage capacity.
Multi‑Datacenter Deployment
Each datacenter runs its own independent Logstash, Elasticsearch, Kafka, and Kibana clusters, forming a closed loop that avoids cross‑datacenter traffic and latency.
Introducing Filebeat
Filebeat, written in Go, consumes far less CPU and memory than Logstash. Example Filebeat configuration:
# filebeat.yml
filebeat.prospectors:
- input_type: log
paths: /var/log/nginx/access.log
json.message_key:
output.elasticsearch:
hosts: ["localhost"]
index: "filebeat-nginx-%{+yyyy.MM.dd}"Performance tests show Filebeat uses ~38% CPU vs. Logstash's ~54% and processes logs ~7× faster.
Lessons Learned
Indexer processes may crash; use a supervisor to keep them running.
Java exception stack traces span multiple lines; resolve with Logstash codec/multiline plugin:
input {
stdin {
codec => multiline {
pattern => "^\["
negate => true
what => "previous"
}
}
}Time‑zone mismatches cause an 8‑hour offset; Kibana adjusts timestamps to the browser's time zone.
Grok parse failures often stem from inconsistent log formats; ensure uniform logging and use online Grok debuggers.
Summary
The ELK/EFK‑based log solution offers high scalability (TB‑level daily data), ease of use through Kibana's visual interface, near‑real‑time query response, and an attractive dashboard, making it suitable for large‑scale log management in modern backend operations.
Hujiang Technology
We focus on the real-world challenges developers face, delivering authentic, practical content and a direct platform for technical networking among developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.