Operations 19 min read

Mastering EFK: Complete Guide to Building a Scalable Log Management Solution

This comprehensive guide walks you through building a scalable EFK log management solution, covering architecture components, high‑availability design, environment preparation, detailed Elasticsearch, Fluentd and Kibana deployment steps, index optimization, monitoring, alerting, security hardening, troubleshooting and best‑practice recommendations for modern cloud‑native operations.

Raymond Ops
Raymond Ops
Raymond Ops
Mastering EFK: Complete Guide to Building a Scalable Log Management Solution

1. Overview

EFK (Elasticsearch, Fluentd, Kibana) is a complete log collection, storage, analysis and visualization stack widely used in micro‑service and cloud‑native environments.

1.1 Components

Elasticsearch : distributed search and analytics engine for storing and retrieving log data.

Fluentd : open‑source data collector that gathers, filters and forwards logs.

Kibana : web UI for visualizing, querying and monitoring logs.

1.2 Technical Advantages

Unified log management across distributed systems.

Near‑real‑time search and analysis.

Rich visual dashboards.

High availability with cluster deployment and failover.

Scalable architecture that grows with business needs.

2. System Architecture Design

2.1 Overall Flow

Application → Fluentd Agent → Kafka/Redis → Fluentd Aggregator → Elasticsearch → Kibana

2.2 Layered Structure

Data source layer : application logs, system logs, container logs, network device logs.

Collection layer : Fluentd agents on each node collect logs locally and perform initial filtering; support multiple input formats.

Buffer layer : Kafka or Redis provides buffering and back‑pressure handling.

Aggregation layer : Fluentd aggregator performs data cleaning, formatting and routing.

Storage layer : Elasticsearch cluster with index management, sharding and backup.

Presentation layer : Kibana visualizes data with dashboards, alerts and custom panels.

2.3 High‑Availability Design

Elasticsearch : multi‑node cluster, master‑eligible nodes, data nodes, replica configuration and automatic failover.

Fluentd : multiple agent instances, load‑balancing, health checks and retry mechanisms.

Kibana : multiple instances behind a load balancer with session persistence.

3. Environment Preparation and Deployment

3.1 System Requirements

CPU ≥ 8 cores, Memory ≥ 16 GB, SSD ≥ 500 GB, 1 Gbps network.

OS: CentOS 7/8, Ubuntu 18.04/20.04; Java ≥ OpenJDK 11; Docker ≥ 19.03 (optional); Kubernetes ≥ 1.18 (optional).

3.2 Elasticsearch Deployment

Single‑node :

# Download and extract
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.15.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.15.0-linux-x86_64.tar.gz
cd elasticsearch-7.15.0/

# Configuration (config/elasticsearch.yml)
cluster.name: efk-cluster
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node

# Start
./bin/elasticsearch

Cluster deployment – master and data node configuration snippets include cluster.name, node.master, node.data, network.host and seed hosts.

Performance tuning – JVM options ( -Xms16g -Xmx16g -XX:+UseG1GC …) and system parameters ( vm.max_map_count=262144, fs.file-max=65536, etc.).

3.3 Fluentd Deployment

Installation via official script or gem, followed by required plugins ( fluent-plugin-elasticsearch, fluent-plugin-kubernetes_metadata_filter, fluent-plugin-rewrite-tag-filter).

Core fluent.conf example shows a tail source for JSON logs, a record transformer adding hostname, timestamp and environment, and an Elasticsearch match with index, logstash format and retry settings.

Docker compose example runs Fluentd container with volume mounts for configuration and logs, exposing port 24224.

3.4 Kibana Deployment

Download, extract and configure kibana.yml (port 5601, host 0.0.0.0, Elasticsearch hosts, index name, logging).

Docker compose runs Kibana container, linking to Elasticsearch and exposing port 5601.

4. Log Collection Strategies

4.1 Application Logs

File‑based collection using tail source, JSON or custom parsers, and optional regex parsing for Nginx logs.

4.2 Container Logs

Use kubernetes_metadata_filter and kubernetes_metadata filters to enrich logs with pod, namespace and node information.

4.3 System Logs

Syslog source on UDP 514, with record transformer adding source_type and hostname.

4.4 Metrics Collection

Exec source runs vmstat every minute and emits fields such as CPU, memory, I/O.

4.5 Parsing

JSON parser for structured logs; regex parser for Apache/Nginx logs; generic parser configuration examples.

5. Index Management and Optimization

5.1 Index Templates

{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "index.codec": "best_compression"
    },
    "mappings": {
      "properties": {
        "@timestamp": {"type":"date"},
        "level": {"type":"keyword"},
        "message": {"type":"text","analyzer":"standard"},
        "hostname": {"type":"keyword"}
      }
    }
  }
}

5.2 ILM Policy

{
  "policy": {
    "phases": {
      "hot": {"actions": {"rollover": {"max_size":"50gb","max_age":"7d"}}},
      "warm": {"min_age":"7d","actions": {"allocate": {"number_of_replicas":0}}},
      "cold": {"min_age":"30d","actions": {"allocate": {"number_of_replicas":0}}},
      "delete": {"min_age":"90d"}
    }
  }
}

5.3 Performance Optimizations

Shard size 10‑50 GB, shard count = data nodes × 1‑3.

Replica count based on availability requirements.

Query optimization example JSON query for recent ERROR logs.

Compression settings ( index.codec: best_compression) and refresh interval.

Cache tuning ( indices.memory.index_buffer_size: 30%, indices.fielddata.cache.size: 20%, indices.queries.cache.size: 10%).

6. Monitoring and Alerting

6.1 System Monitoring

Elasticsearch health, node stats and index stats via curl commands.

Fluentd monitoring agent on port 24220 with JSON log format.

6.2 Alert Configuration

Kibana Watcher JSON defines a 1‑minute schedule, searches app-logs-* for ERROR level in the last 5 minutes, and sends email when hits > 10.

6.3 Performance Metrics

Indexing rate (docs/sec).

Query latency (ms).

Heap usage %.

Disk usage %.

Network I/O.

Monitoring script es_monitor.sh checks heap and disk usage thresholds and prints warnings.

7. Security Configuration

7.1 Access Control

Enable X‑Pack security in elasticsearch.yml, configure TLS keystore/truststore, create users and roles (e.g., kibana_user with kibana_system role, log_reader role with indices:read on app-logs-*).

7.2 Network Security

Open required ports (9200, 9300, 5601, 24224) via firewall-cmd; configure SSL/TLS for Kibana‑Elasticsearch communication.

7.3 Data Encryption

Transport encryption settings for Elasticsearch clients; enable xpack.security.encryptionKey and xpack.security.encryption.enabled for at‑rest encryption.

8. Troubleshooting

8.1 Common Issues

Elasticsearch fails to start – check logs, run elasticsearch-config-check, verify JVM.

Fluentd does not collect – dry‑run config, check file permissions, test network connectivity.

8.2 Performance Problems

Enable slow‑query logging; adjust indices.memory.index_buffer_size to improve indexing throughput.

8.3 Data Recovery

Create snapshot repository (FS type) and take snapshots; restore with /_snapshot/backup/snapshot_1/_restore.

9. Best Practices

9.1 Architecture Design

Separate collection, aggregation, storage and presentation layers.

Scale each layer independently based on load.

9.2 Capacity Planning

Estimate daily log volume, retention period and query concurrency.

Plan storage and shard count accordingly.

9.3 Configuration Tuning

Production Elasticsearch settings: bootstrap.memory_lock: true, indices.memory.index_buffer_size: 30%, thread pool sizes, discovery.zen.minimum_master_nodes: 2.

Fluentd system workers, file buffer with flush interval and chunk size.

9.4 Operational Standards

Standard JSON log schema (timestamp, level, service, message, userId, ip, traceId).

Index naming conventions (e.g., app-logs-YYYY.MM.DD).

9.5 Monitoring and Alerting

Cluster health, node status, index health, query latency.

Alert on red/yellow cluster state, disk pressure, high query latency, abnormal log volume.

10. Conclusion

EFK provides a complete, scalable and secure log management platform. Proper architecture, configuration, monitoring and security practices enable reliable log collection, analysis and alerting that support modern cloud‑native applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringElasticsearchSecurityLog ManagementKibanaFluentdEFK
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.