Cloud Native 16 min read

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

This comprehensive guide explains why log management is critical for large Kubernetes clusters, outlines common pain points, presents full‑stack architectures, details EFK and Loki implementations with code samples, and offers performance, security, cost‑optimization, and future‑trend recommendations.

Ops Community
Ops Community
Ops Community
Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

Understanding Kubernetes Cluster Log Collection and Analysis

Author: Senior operations engineer with 8 years of large‑scale distributed system experience.

Introduction

With micro‑services becoming the norm, Kubernetes is the de‑facto container orchestration platform, but as clusters grow, log management becomes a critical challenge. Traditional SSH log inspection cannot keep up with hundreds of pods across dozens of nodes, making a robust log collection and analysis system essential.

Kubernetes Log Management Pain Points

1. Log dispersion

Container logs stored in /var/lib/docker/containers/ System component logs (kubelet, kube‑proxy) scattered on each node

Application logs lost when pods are rescheduled

2. Log lifecycle issues

Logs disappear after pod restart

Node failures make historical logs inaccessible

Container crashes may not flush logs in time

3. Log volume

Single micro‑service can generate gigabytes of logs per day

Whole cluster may produce terabytes of logs

Storage cost and query performance must be considered

Log Architecture Overview

┌─────────────────────────────────────────┐
│           Application Layer Logs        │
├─────────────────────────────────────────┤
│           Platform Layer Logs            │
├─────────────────────────────────────────┤
│          Infrastructure Layer Logs       │
└─────────────────────────────────────────┘

Application layer logs : business‑level logs

Platform layer logs : Kubernetes components such as kube‑apiserver, scheduler

Infrastructure layer logs : node system logs and container runtime logs

Core Solution: EFK Stack

The EFK stack (Elasticsearch + Fluentd + Kibana) is the most mature solution for Kubernetes logging.

Architecture Diagram

Pod1 ──┐
Pod2 ──┼── Fluentd ── Elasticsearch ── Kibana
Pod3 ──┘   (DaemonSet)   (Cluster)      (Visualization)

Fluentd DaemonSet Deployment

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      serviceAccount: fluentd-elasticsearch
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd-elasticsearch
        image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /etc/fluent/config.d
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluentd-config

Key Fluentd configuration:

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch-logging
  port 9200
  logstash_format true
  logstash_prefix kubernetes
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes.system.buffer
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 5s
    retry_forever true
    retry_max_interval 30
    chunk_limit_size 2M
    queue_limit_length 8
  </buffer>
</match>

Elasticsearch Cluster (StatefulSet)

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch-logging
  namespace: kube-system
spec:
  serviceName: elasticsearch-logging
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch-logging
  template:
    metadata:
      labels:
        app: elasticsearch-logging
    spec:
      containers:
      - name: elasticsearch-logging
        image: docker.elastic.co/elasticsearch/elasticsearch:7.9.0
        resources:
          limits:
            cpu: 1000m
            memory: 3Gi
          requests:
            cpu: 100m
            memory: 3Gi
        ports:
        - containerPort: 9200
          name: db
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: cluster.initial_master_nodes
          value: "elasticsearch-logging-0,elasticsearch-logging-1,elasticsearch-logging-2"
        - name: discovery.seed_hosts
          value: "elasticsearch-logging"
        - name: cluster.name
          value: "k8s-logs"
        - name: network.host
          value: "0.0.0.0"
        - name: ES_JAVA_OPTS
          value: "-Xms1536m -Xmx1536m"
        volumeMounts:
        - name: elasticsearch-logging
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-logging
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 100Gi

Kibana Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana-logging
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana-logging
  template:
    metadata:
      labels:
        app: kibana-logging
    spec:
      containers:
      - name: kibana-logging
        image: docker.elastic.co/kibana/kibana:7.9.0
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 1Gi
        env:
        - name: ELASTICSEARCH_HOSTS
          value: http://elasticsearch-logging:9200
        ports:
        - containerPort: 5601
          name: ui
          protocol: TCP

Lightweight Alternative: Loki + Promtail

For medium‑size clusters, Grafana Loki offers a lower‑cost solution.

Loki Advantages

Low storage cost – only labels are indexed

Cloud‑native design – integrates with Prometheus and Grafana

Simple deployment with fewer components

Promtail DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring
spec:
  selector:
    matchLabels:
      name: promtail
  template:
    metadata:
      labels:
        name: promtail
    spec:
      serviceAccount: promtail
      containers:
      - name: promtail
        image: grafana/promtail:2.4.0
        args:
        - -config.file=/etc/promtail/config.yml
        - -client.url=http://loki:3100/loki/api/v1/push
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Promtail Configuration (excerpt)

server:
  http_listen_port: 9080
positions:
  filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  pipeline_stages:
  - docker: {}
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_controller_name]
    regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})?
    target_label: __tmp_controller_name
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_label_app,__tmp_controller_name,__meta_kubernetes_pod_name]
    regex: ^;*([^;]+)(;.*)?$
    target_label: app
    replacement: $1

Advanced Features

Structured Log Standardization

{
  "timestamp":"2024-01-15T10:30:00Z",
  "level":"INFO",
  "service":"user-service",
  "trace_id":"abc123def456",
  "span_id":"789xyz",
  "message":"User login successful",
  "user_id":"12345",
  "ip":"192.168.1.100"
}

Multiline Log Handling (Java stack traces)

<source>
  @type tail
  path /var/log/containers/*java*.log
  pos_file /var/log/fluentd-java.log.pos
  tag kubernetes.java.*
  read_from_head true
  <parse>
    @type multiline
    format_firstline /^\d{4}-\d{2}-\d{2}/
    format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?<level>[^\s]+)\s+(?<message>.*)/
  </parse>
</source>

Performance Optimization & Best Practices

Log Rotation & Cleanup

apiVersion: v1
kind: ConfigMap
metadata:
  name: logrotate-config
data:
  logrotate.conf: |
    /var/log/containers/*.log {
      daily
      missingok
      rotate 7
      compress
      delaycompress
      copytruncate
    }

Resource Limits & Monitoring

resources:
  limits:
    cpu: "1"
    memory: "2Gi"
  requests:
    cpu: "0.5"
    memory: "1Gi"

Fluentd buffer usage

Elasticsearch cluster health

Log loss rate

Query response time

Index Lifecycle Management

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "5GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "delete": {
        "min_age": "30d"
      }
    }
  }
}

Security & Compliance

Sensitive Data Masking

<filter kubernetes.**>
  @type record_transformer
  <record>
    message ${record["message"].gsub(/password=\w+/, "password=***")}
  </record>
</filter>

Access Control (RBAC)

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups: [""]
  resources: ["pods","namespaces"]
  verbs: ["get","list","watch"]

Troubleshooting Cases

Case 1: Lost Application Logs

Symptoms : Logs disappear after pod restart.

Steps :

Check Fluentd buffer configuration.

Verify Elasticsearch cluster status.

Inspect log rotation policy.

Fix :

<buffer>
  @type file
  path /var/log/fluentd-buffers/kubernetes.system.buffer
  flush_mode immediate
  retry_type exponential_backoff
  retry_forever true
  chunk_limit_size 8MB
  flush_thread_count 8
</buffer>

Case 2: Slow Kibana Queries

Symptoms : Query response exceeds 30 seconds.

Resolution :

Optimize Elasticsearch index mapping.

Apply index lifecycle management.

Adjust JVM heap settings.

Cost Optimization

Storage Tiering

Hot data on SSD (1‑7 days)

Warm data on HDD (7‑30 days)

Cold data in object storage (>30 days)

Log Sampling

<filter kubernetes.**>
  @type sampling
  sampling_rate 10
  tag sampled.kubernetes
</filter>

Selective Field Indexing

{
  "mappings": {
    "properties": {
      "@timestamp": {"type":"date"},
      "level": {"type":"keyword"},
      "message": {"type":"text","index":false}
    }
  }
}

Monitoring & Alerting

groups:
- name: logging.rules
  rules:
  - alert: FluentdDown
    expr: up{job="fluentd"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Fluentd is down"
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster status is RED"

Future Trends

Unified Observability

Logs + Metrics + Traces on a single platform

OpenTelemetry standardization

AI‑Assisted Operations

Intelligent anomaly detection

Automated root‑cause analysis

Predictive maintenance

Edge‑Computing Adaptation

Lightweight log collectors for edge nodes

Collaborative processing between edge and cloud

Conclusion

Kubernetes log management is a complex system engineering task that requires careful consideration of architecture design, technology selection, performance tuning, security, and cost control.

Technology choice: EFK for large clusters, Loki for smaller ones.

Architecture: DaemonSet agents, centralized storage.

Performance: Proper buffering and index lifecycle policies.

Cost: Tiered storage, sampling, selective indexing.

Security: Data masking and RBAC.

Adopt a gradual rollout: pilot core services, expand to all workloads, and continuously refine configurations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeObservabilityKubernetesLog ManagementLokiEFK
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.