Mastering EFK: Complete Guide to Building a Scalable Log Management Solution
This comprehensive guide walks you through building a scalable EFK log management solution, covering architecture components, high‑availability design, environment preparation, detailed Elasticsearch, Fluentd and Kibana deployment steps, index optimization, monitoring, alerting, security hardening, troubleshooting and best‑practice recommendations for modern cloud‑native operations.
1. Overview
EFK (Elasticsearch, Fluentd, Kibana) is a complete log collection, storage, analysis and visualization stack widely used in micro‑service and cloud‑native environments.
1.1 Components
Elasticsearch : distributed search and analytics engine for storing and retrieving log data.
Fluentd : open‑source data collector that gathers, filters and forwards logs.
Kibana : web UI for visualizing, querying and monitoring logs.
1.2 Technical Advantages
Unified log management across distributed systems.
Near‑real‑time search and analysis.
Rich visual dashboards.
High availability with cluster deployment and failover.
Scalable architecture that grows with business needs.
2. System Architecture Design
2.1 Overall Flow
Application → Fluentd Agent → Kafka/Redis → Fluentd Aggregator → Elasticsearch → Kibana2.2 Layered Structure
Data source layer : application logs, system logs, container logs, network device logs.
Collection layer : Fluentd agents on each node collect logs locally and perform initial filtering; support multiple input formats.
Buffer layer : Kafka or Redis provides buffering and back‑pressure handling.
Aggregation layer : Fluentd aggregator performs data cleaning, formatting and routing.
Storage layer : Elasticsearch cluster with index management, sharding and backup.
Presentation layer : Kibana visualizes data with dashboards, alerts and custom panels.
2.3 High‑Availability Design
Elasticsearch : multi‑node cluster, master‑eligible nodes, data nodes, replica configuration and automatic failover.
Fluentd : multiple agent instances, load‑balancing, health checks and retry mechanisms.
Kibana : multiple instances behind a load balancer with session persistence.
3. Environment Preparation and Deployment
3.1 System Requirements
CPU ≥ 8 cores, Memory ≥ 16 GB, SSD ≥ 500 GB, 1 Gbps network.
OS: CentOS 7/8, Ubuntu 18.04/20.04; Java ≥ OpenJDK 11; Docker ≥ 19.03 (optional); Kubernetes ≥ 1.18 (optional).
3.2 Elasticsearch Deployment
Single‑node :
# Download and extract
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.15.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.15.0-linux-x86_64.tar.gz
cd elasticsearch-7.15.0/
# Configuration (config/elasticsearch.yml)
cluster.name: efk-cluster
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
# Start
./bin/elasticsearchCluster deployment – master and data node configuration snippets include cluster.name, node.master, node.data, network.host and seed hosts.
Performance tuning – JVM options ( -Xms16g -Xmx16g -XX:+UseG1GC …) and system parameters ( vm.max_map_count=262144, fs.file-max=65536, etc.).
3.3 Fluentd Deployment
Installation via official script or gem, followed by required plugins ( fluent-plugin-elasticsearch, fluent-plugin-kubernetes_metadata_filter, fluent-plugin-rewrite-tag-filter).
Core fluent.conf example shows a tail source for JSON logs, a record transformer adding hostname, timestamp and environment, and an Elasticsearch match with index, logstash format and retry settings.
Docker compose example runs Fluentd container with volume mounts for configuration and logs, exposing port 24224.
3.4 Kibana Deployment
Download, extract and configure kibana.yml (port 5601, host 0.0.0.0, Elasticsearch hosts, index name, logging).
Docker compose runs Kibana container, linking to Elasticsearch and exposing port 5601.
4. Log Collection Strategies
4.1 Application Logs
File‑based collection using tail source, JSON or custom parsers, and optional regex parsing for Nginx logs.
4.2 Container Logs
Use kubernetes_metadata_filter and kubernetes_metadata filters to enrich logs with pod, namespace and node information.
4.3 System Logs
Syslog source on UDP 514, with record transformer adding source_type and hostname.
4.4 Metrics Collection
Exec source runs vmstat every minute and emits fields such as CPU, memory, I/O.
4.5 Parsing
JSON parser for structured logs; regex parser for Apache/Nginx logs; generic parser configuration examples.
5. Index Management and Optimization
5.1 Index Templates
{
"index_patterns": ["app-logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.codec": "best_compression"
},
"mappings": {
"properties": {
"@timestamp": {"type":"date"},
"level": {"type":"keyword"},
"message": {"type":"text","analyzer":"standard"},
"hostname": {"type":"keyword"}
}
}
}
}5.2 ILM Policy
{
"policy": {
"phases": {
"hot": {"actions": {"rollover": {"max_size":"50gb","max_age":"7d"}}},
"warm": {"min_age":"7d","actions": {"allocate": {"number_of_replicas":0}}},
"cold": {"min_age":"30d","actions": {"allocate": {"number_of_replicas":0}}},
"delete": {"min_age":"90d"}
}
}
}5.3 Performance Optimizations
Shard size 10‑50 GB, shard count = data nodes × 1‑3.
Replica count based on availability requirements.
Query optimization example JSON query for recent ERROR logs.
Compression settings ( index.codec: best_compression) and refresh interval.
Cache tuning ( indices.memory.index_buffer_size: 30%, indices.fielddata.cache.size: 20%, indices.queries.cache.size: 10%).
6. Monitoring and Alerting
6.1 System Monitoring
Elasticsearch health, node stats and index stats via curl commands.
Fluentd monitoring agent on port 24220 with JSON log format.
6.2 Alert Configuration
Kibana Watcher JSON defines a 1‑minute schedule, searches app-logs-* for ERROR level in the last 5 minutes, and sends email when hits > 10.
6.3 Performance Metrics
Indexing rate (docs/sec).
Query latency (ms).
Heap usage %.
Disk usage %.
Network I/O.
Monitoring script es_monitor.sh checks heap and disk usage thresholds and prints warnings.
7. Security Configuration
7.1 Access Control
Enable X‑Pack security in elasticsearch.yml, configure TLS keystore/truststore, create users and roles (e.g., kibana_user with kibana_system role, log_reader role with indices:read on app-logs-*).
7.2 Network Security
Open required ports (9200, 9300, 5601, 24224) via firewall-cmd; configure SSL/TLS for Kibana‑Elasticsearch communication.
7.3 Data Encryption
Transport encryption settings for Elasticsearch clients; enable xpack.security.encryptionKey and xpack.security.encryption.enabled for at‑rest encryption.
8. Troubleshooting
8.1 Common Issues
Elasticsearch fails to start – check logs, run elasticsearch-config-check, verify JVM.
Fluentd does not collect – dry‑run config, check file permissions, test network connectivity.
8.2 Performance Problems
Enable slow‑query logging; adjust indices.memory.index_buffer_size to improve indexing throughput.
8.3 Data Recovery
Create snapshot repository (FS type) and take snapshots; restore with /_snapshot/backup/snapshot_1/_restore.
9. Best Practices
9.1 Architecture Design
Separate collection, aggregation, storage and presentation layers.
Scale each layer independently based on load.
9.2 Capacity Planning
Estimate daily log volume, retention period and query concurrency.
Plan storage and shard count accordingly.
9.3 Configuration Tuning
Production Elasticsearch settings: bootstrap.memory_lock: true, indices.memory.index_buffer_size: 30%, thread pool sizes, discovery.zen.minimum_master_nodes: 2.
Fluentd system workers, file buffer with flush interval and chunk size.
9.4 Operational Standards
Standard JSON log schema (timestamp, level, service, message, userId, ip, traceId).
Index naming conventions (e.g., app-logs-YYYY.MM.DD).
9.5 Monitoring and Alerting
Cluster health, node status, index health, query latency.
Alert on red/yellow cluster state, disk pressure, high query latency, abnormal log volume.
10. Conclusion
EFK provides a complete, scalable and secure log management platform. Proper architecture, configuration, monitoring and security practices enable reliable log collection, analysis and alerting that support modern cloud‑native applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
