Operations 14 min read

Master Container Log Monitoring: From Basics to Enterprise‑Grade Solutions

This guide walks you through building a robust container log collection and monitoring system—covering challenges, ELK and Prometheus‑Grafana stacks, Loki, performance tuning, automation scripts, and future AIOps trends—to keep your services reliable and observable.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Master Container Log Monitoring: From Basics to Enterprise‑Grade Solutions

Challenges in Containerized Logging

Logs are scattered across many containers and lost on restart.

Dynamic scaling changes log locations.

Root‑cause analysis is difficult without centralized collection.

Typical Architecture

Application containers → log collector (Fluentd/Filebeat) → storage (Elasticsearch or Loki) → processing (Logstash) → visualization (Kibana/Grafana) → alerting (Alertmanager).

Solution 1: ELK Stack

Docker‑Compose Deployment

# docker-compose.yml
version: '3.7'
services:
  elasticsearch:
    image: elasticsearch:7.17.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data
  kibana:
    image: kibana:7.17.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch
  logstash:
    image: logstash:7.17.0
    container_name: logstash
    ports:
      - "5000:5000"
    volumes:
      - ./logstash/config:/usr/share/logstash/pipeline
    depends_on:
      - elasticsearch
volumes:
  es_data:

Filebeat Configuration

# filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
processors:
- add_docker_metadata:
    host: "unix:///var/run/docker.sock"
- decode_json_fields:
    fields: ["message"]
    target: ""
    overwrite_keys: true
output.logstash:
  hosts: ["logstash:5000"]
logging.level: info

Logstash Pipeline

# logstash.conf
input { beats { port => 5000 } }
filter {
  if [container][image][name] =~ /nginx/ {
    grok { match => { "message" => "%{NGINXACCESS}" } }
    date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"] }
  }
  if [container][image][name] =~ /app/ {
    json { source => "message" }
    mutate { add_field => { "log_type" => "application" } }
  }
}
output { elasticsearch { hosts => ["elasticsearch:9200"] index => "logs-%{+YYYY.MM.dd}" } }

Elasticsearch Performance Tuning

# elasticsearch.yml
cluster.name: prod-logging
node.name: node-1
bootstrap.memory_lock: true
index.number_of_shards: 3
index.number_of_replicas: 1
xpack.ilm.enabled: true

Solution 2: Prometheus + Grafana

Prometheus Configuration (prometheus.yml)

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
rule_files:
  - "alert_rules.yml"
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
scrape_configs:
  - job_name: 'docker'
    static_configs:
      - targets: ['cadvisor:8080']
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  - job_name: 'application'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        port: 8080
    relabel_configs:
      - source_labels: [__meta_docker_container_label_monitoring]
        regex: "true"
        action: keep

Alert Rules (alert_rules.yml)

# alert_rules.yml
groups:
  - name: container_alerts
    rules:
      - alert: ContainerHighCPU
        expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Container CPU usage high"
          description: "Container {{ $labels.name }} CPU > 80%"
      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container memory usage high"
          description: "Container {{ $labels.name }} memory > 90%"
      - alert: ContainerDown
        expr: absent(container_last_seen)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container stopped"
          description: "Container {{ $labels.name }} is not running"

Grafana Dashboard (JSON snippet)

{
  "dashboard": {
    "title": "Container Monitoring",
    "panels": [
      {"title": "CPU Usage","type": "graph","targets":[{"expr":"rate(container_cpu_usage_seconds_total[5m]) * 100","legendFormat":"{{name}}"}]},
      {"title": "Memory Usage","type": "singlestat","targets":[{"expr":"container_memory_usage_bytes / container_spec_memory_limit_bytes * 100"}]}
    ]
  }
}

Solution 3: Loki + Grafana

Loki Configuration (loki-config.yaml)

# loki-config.yaml
auth_enabled: false
server:
  http_listen_port: 3100
ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
  filesystem:
    directory: /loki/chunks
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Promtail Configuration (promtail-config.yaml)

# promtail-config.yaml
server:
  http_listen_port: 9080
grpc_listen_port: 0
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level: ''
      - output:
          source: msg

Spring Boot Logging Standardization

// logback-spring.xml
<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
      <providers>
        <timestamp/>
        <logLevel/>
        <loggerName/>
        <message/>
        <mdc/>
        <stackTrace/>
      </providers>
    </encoder>
  </appender>
  <root level="INFO">
    <appender-ref ref="STDOUT"/>
  </root>
</configuration>

Performance Optimizations

Filebeat: set harvester_buffer_size: 16384, max_bytes: 10485760, enable compression ( compression_level: 3) and bulk settings ( bulk_max_size: 3200, worker: 2).

Elasticsearch: use ILM policies to rollover indices at 10GB or 7d, shrink warm phase to 1 shard, freeze cold phase, and delete after 90d.

Prometheus: query 95th‑percentile latency with

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

and monitor JVM metrics via Spring Actuator endpoints.

Automation Example (auto‑scale.sh)

#!/bin/bash
# Auto‑scale based on error rate from Prometheus
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
  echo "High error rate, scaling up..."
  docker service update --replicas 5 app_service
  # Optional DingTalk notification
  curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=$TOKEN" \
    -H 'Content-Type: application/json' \
    -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"⚠️ Auto‑scale triggered, error rate: $ERROR_RATE\"}}"
fi

Key Takeaways

Select a stack (ELK, Loki, or managed cloud) that matches scale and cost requirements.

Standardize log format (JSON) at the application level for easy parsing.

Configure meaningful alerts to avoid fatigue.

Apply storage lifecycle policies to control index size and cost.

Automate scaling and notification based on metric thresholds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.