Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions
This comprehensive guide explains why log management is critical for large Kubernetes clusters, outlines common pain points, presents full‑stack architectures, details EFK and Loki implementations with code samples, and offers performance, security, cost‑optimization, and future‑trend recommendations.
Understanding Kubernetes Cluster Log Collection and Analysis
Author: Senior operations engineer with 8 years of large‑scale distributed system experience.
Introduction
With micro‑services becoming the norm, Kubernetes is the de‑facto container orchestration platform, but as clusters grow, log management becomes a critical challenge. Traditional SSH log inspection cannot keep up with hundreds of pods across dozens of nodes, making a robust log collection and analysis system essential.
Kubernetes Log Management Pain Points
1. Log dispersion
Container logs stored in /var/lib/docker/containers/ System component logs (kubelet, kube‑proxy) scattered on each node
Application logs lost when pods are rescheduled
2. Log lifecycle issues
Logs disappear after pod restart
Node failures make historical logs inaccessible
Container crashes may not flush logs in time
3. Log volume
Single micro‑service can generate gigabytes of logs per day
Whole cluster may produce terabytes of logs
Storage cost and query performance must be considered
Log Architecture Overview
┌─────────────────────────────────────────┐
│ Application Layer Logs │
├─────────────────────────────────────────┤
│ Platform Layer Logs │
├─────────────────────────────────────────┤
│ Infrastructure Layer Logs │
└─────────────────────────────────────────┘Application layer logs : business‑level logs
Platform layer logs : Kubernetes components such as kube‑apiserver, scheduler
Infrastructure layer logs : node system logs and container runtime logs
Core Solution: EFK Stack
The EFK stack (Elasticsearch + Fluentd + Kibana) is the most mature solution for Kubernetes logging.
Architecture Diagram
Pod1 ──┐
Pod2 ──┼── Fluentd ── Elasticsearch ── Kibana
Pod3 ──┘ (DaemonSet) (Cluster) (Visualization)Fluentd DaemonSet Deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
serviceAccount: fluentd-elasticsearch
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config-volume
mountPath: /etc/fluent/config.d
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config-volume
configMap:
name: fluentd-configKey Fluentd configuration:
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch-logging
port 9200
logstash_format true
logstash_prefix kubernetes
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever true
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
</buffer>
</match>Elasticsearch Cluster (StatefulSet)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch-logging
namespace: kube-system
spec:
serviceName: elasticsearch-logging
replicas: 3
selector:
matchLabels:
app: elasticsearch-logging
template:
metadata:
labels:
app: elasticsearch-logging
spec:
containers:
- name: elasticsearch-logging
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.0
resources:
limits:
cpu: 1000m
memory: 3Gi
requests:
cpu: 100m
memory: 3Gi
ports:
- containerPort: 9200
name: db
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: cluster.initial_master_nodes
value: "elasticsearch-logging-0,elasticsearch-logging-1,elasticsearch-logging-2"
- name: discovery.seed_hosts
value: "elasticsearch-logging"
- name: cluster.name
value: "k8s-logs"
- name: network.host
value: "0.0.0.0"
- name: ES_JAVA_OPTS
value: "-Xms1536m -Xmx1536m"
volumeMounts:
- name: elasticsearch-logging
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: elasticsearch-logging
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "fast-ssd"
resources:
requests:
storage: 100GiKibana Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana-logging
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kibana-logging
template:
metadata:
labels:
app: kibana-logging
spec:
containers:
- name: kibana-logging
image: docker.elastic.co/kibana/kibana:7.9.0
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 1Gi
env:
- name: ELASTICSEARCH_HOSTS
value: http://elasticsearch-logging:9200
ports:
- containerPort: 5601
name: ui
protocol: TCPLightweight Alternative: Loki + Promtail
For medium‑size clusters, Grafana Loki offers a lower‑cost solution.
Loki Advantages
Low storage cost – only labels are indexed
Cloud‑native design – integrates with Prometheus and Grafana
Simple deployment with fewer components
Promtail DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: monitoring
spec:
selector:
matchLabels:
name: promtail
template:
metadata:
labels:
name: promtail
spec:
serviceAccount: promtail
containers:
- name: promtail
image: grafana/promtail:2.4.0
args:
- -config.file=/etc/promtail/config.yml
- -client.url=http://loki:3100/loki/api/v1/push
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containersPromtail Configuration (excerpt)
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {}
relabel_configs:
- source_labels: [__meta_kubernetes_pod_controller_name]
regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})?
target_label: __tmp_controller_name
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_label_app,__tmp_controller_name,__meta_kubernetes_pod_name]
regex: ^;*([^;]+)(;.*)?$
target_label: app
replacement: $1Advanced Features
Structured Log Standardization
{
"timestamp":"2024-01-15T10:30:00Z",
"level":"INFO",
"service":"user-service",
"trace_id":"abc123def456",
"span_id":"789xyz",
"message":"User login successful",
"user_id":"12345",
"ip":"192.168.1.100"
}Multiline Log Handling (Java stack traces)
<source>
@type tail
path /var/log/containers/*java*.log
pos_file /var/log/fluentd-java.log.pos
tag kubernetes.java.*
read_from_head true
<parse>
@type multiline
format_firstline /^\d{4}-\d{2}-\d{2}/
format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(?<level>[^\s]+)\s+(?<message>.*)/
</parse>
</source>Performance Optimization & Best Practices
Log Rotation & Cleanup
apiVersion: v1
kind: ConfigMap
metadata:
name: logrotate-config
data:
logrotate.conf: |
/var/log/containers/*.log {
daily
missingok
rotate 7
compress
delaycompress
copytruncate
}Resource Limits & Monitoring
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.5"
memory: "1Gi"Fluentd buffer usage
Elasticsearch cluster health
Log loss rate
Query response time
Index Lifecycle Management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "5GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"number_of_replicas": 0
}
}
},
"delete": {
"min_age": "30d"
}
}
}
}Security & Compliance
Sensitive Data Masking
<filter kubernetes.**>
@type record_transformer
<record>
message ${record["message"].gsub(/password=\w+/, "password=***")}
</record>
</filter>Access Control (RBAC)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluentd
rules:
- apiGroups: [""]
resources: ["pods","namespaces"]
verbs: ["get","list","watch"]Troubleshooting Cases
Case 1: Lost Application Logs
Symptoms : Logs disappear after pod restart.
Steps :
Check Fluentd buffer configuration.
Verify Elasticsearch cluster status.
Inspect log rotation policy.
Fix :
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode immediate
retry_type exponential_backoff
retry_forever true
chunk_limit_size 8MB
flush_thread_count 8
</buffer>Case 2: Slow Kibana Queries
Symptoms : Query response exceeds 30 seconds.
Resolution :
Optimize Elasticsearch index mapping.
Apply index lifecycle management.
Adjust JVM heap settings.
Cost Optimization
Storage Tiering
Hot data on SSD (1‑7 days)
Warm data on HDD (7‑30 days)
Cold data in object storage (>30 days)
Log Sampling
<filter kubernetes.**>
@type sampling
sampling_rate 10
tag sampled.kubernetes
</filter>Selective Field Indexing
{
"mappings": {
"properties": {
"@timestamp": {"type":"date"},
"level": {"type":"keyword"},
"message": {"type":"text","index":false}
}
}
}Monitoring & Alerting
groups:
- name: logging.rules
rules:
- alert: FluentdDown
expr: up{job="fluentd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Fluentd is down"
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Elasticsearch cluster status is RED"Future Trends
Unified Observability
Logs + Metrics + Traces on a single platform
OpenTelemetry standardization
AI‑Assisted Operations
Intelligent anomaly detection
Automated root‑cause analysis
Predictive maintenance
Edge‑Computing Adaptation
Lightweight log collectors for edge nodes
Collaborative processing between edge and cloud
Conclusion
Kubernetes log management is a complex system engineering task that requires careful consideration of architecture design, technology selection, performance tuning, security, and cost control.
Technology choice: EFK for large clusters, Loki for smaller ones.
Architecture: DaemonSet agents, centralized storage.
Performance: Proper buffering and index lifecycle policies.
Cost: Tiered storage, sampling, selective indexing.
Security: Data masking and RBAC.
Adopt a gradual rollout: pilot core services, expand to all workloads, and continuously refine configurations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
