Cloud Native 36 min read

Practical Cloud‑Native Log Aggregation with Loki, Promtail & Grafana

This guide walks SREs and DevOps engineers through the challenges of log aggregation in containerized Kubernetes environments and shows how Loki, Promtail, and Grafana together provide a low‑cost, label‑based alternative to the ELK stack, covering architecture, deployment, query language, multi‑tenant security, performance tuning, alerting, and disaster recovery.

Ops Community
Ops Community
Ops Community
Practical Cloud‑Native Log Aggregation with Loki, Promtail & Grafana

Problem Background

After moving workloads to containers and Kubernetes, traditional log aggregation becomes cumbersome: pod restarts lose stdout logs, each pod runs its own Filebeat/Flume, Elasticsearch storage explodes (2 TB/day, half for indexes), query latency is high (P99 = 8 s for 3‑day range), logs are scattered across multiple clouds, compliance requires 6‑month retention (cost ≈ 2 M CNY for current ES size), and log‑based alerting with commercial ELK is expensive.

Why Loki + Promtail + Grafana

Loki stores only labels, not full log content, reducing storage by an order of magnitude compared with Elasticsearch. It reuses Prometheus' label model, integrates tightly with Grafana for unified metrics & logs, and uses a push‑based ingestion model via Promtail.

Index cost : Low (Loki) vs High (ELK)

Full‑text search : Weak – scans chunks (Loki) vs Strong – inverted index (ELK)

Resource usage : Low (Loki) vs Medium‑High (ELK)

K8s integration : Very strong (Loki) vs Strong (ECK)

Cost : Low (Loki) vs Medium‑High (ELK)

Core Design Philosophy

No log content indexing – only label sets are indexed.

Prometheus‑style labels – familiar to PromQL users.

Cheap storage – chunks are compressed and stored in object storage (S3, MinIO, OSS, etc.).

Push model – agents push logs to a distributor.

Grafana dependency – all queries run through Grafana Explore.

Architecture Overview

Promtail / Fluentd / Filebeat
    │ (HTTP push, protobuf / json)
    ▼
Loki Cluster
    Distributor (stateless, load‑balanced)
    │
    ▼
    Ingester (writes chunks, in‑memory sort)
    │
    ▼
    Storage Backend (object store, filesystem)
    │
    ▼
    Query‑Frontend (caches, sharding)
    │
    ▼
    Querier (fetches chunks, merges results)
    │
    ▼
    Compactor (merge, compact, retain, delete)
    │
    ▼
    Grafana (datasource, dashboards, alerts)

Deployment Modes

Monolithic – all components in one process (small dev / test).

SimpleScalable – distributor, ingester, querier, compactor as separate processes (mid‑size clusters).

Microservices – each component independently scalable (large production).

Hardware Sizing

Single‑node : 2 CPU / 4 GB RAM / 20 GB SSD (small), 4 CPU / 8 GB RAM / 50 GB SSD (medium), 8 CPU / 16 GB RAM / 100 GB SSD + object store (large).

Distributed :

Distributor – 2 CPU / 4 GB (2‑3 replicas).

Ingester – 4 CPU / 8 GB (3 replicas).

Querier – 2 CPU / 4 GB (2‑3 replicas).

Query‑Frontend – 2 CPU / 4 GB (2 replicas).

Compactor – 1‑2 CPU / 2 GB (single running instance, optional standby).

Cache – recommended Redis cluster, deployed separately.

Installation Steps (Helm)

Choose Loki vs ELK – refer to the dimension list above.

Create namespace : kubectl create namespace logging Set up object storage (example MinIO) :

helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --namespace storage \
  --set persistence.size=200Gi \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2

Add Grafana repo and fetch default values :

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm show values grafana/loki-simple-scalable > loki-values.yaml

Customize loki-values.yaml – set storage.type: s3, bucket, credentials, schemaConfig, limits_config, replication factor, etc.

Install Loki :

helm install loki grafana/loki-simple-scalable \
  --namespace logging \
  --values loki-values.yaml \
  --version 5.0.0

Install Promtail :

helm install promtail grafana/promtail \
  --namespace logging \
  --values promtail-values.yaml

Verify deployment :

kubectl -n logging get pods
kubectl -n logging port-forward svc/loki 3100:3100
curl http://localhost:3100/ready

Promtail Configuration Highlights

scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    target_label: node
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod
  - source_labels: [__meta_kubernetes_pod_container_name]
    target_label: container
  - source_labels: [__meta_kubernetes_pod_label_app]
    target_label: app
  - source_labels: [__meta_kubernetes_pod_label_component]
    target_label: component
  - source_labels: [__meta_kubernetes_pod_uid]
    target_label: pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_label_app]
    regex: (.+)
    target_label: job
    replacement: $1

LogQL Basics

Log queries use label selectors and pipeline operators:

{namespace="prod", app="api"} |= "ERROR"
{app="nginx"} |~ "(?P<method>\w+) (?P<path>\S+)" | line_format "{{.method}} {{.path}}"
rate({app="api"} |= "error" [5m])
sum by (status) (rate({app="api"} | json | status_code=~"5.." [5m]))
quantile_over_time(0.99, {app="api"} | json | latency_ms != "" | unwrap latency_ms [5m])

Performance Tuning

Keep label cardinality low – avoid high‑cardinality labels such as user_id, request_id, timestamps, or long strings.

Set limits_config.ingestion_rate_mb and ingestion_burst_size_mb according to available bandwidth.

Configure query limits: max_entries_limit_per_query, max_query_series, max_query_parallelism to protect the querier.

Enable query‑frontend cache (embedded or Redis) to reduce repeated chunk scans.

Enable WAL in ingesters ( wal.enabled: true) to avoid data loss on restarts.

Set replication_factor ≥ 2 for high availability (default = 1, production = 3).

Multi‑Tenant Security

Enable multi‑tenant mode with auth_enabled: true. An auth gateway (nginx + auth_request, Ambassador, Pomerium, etc.) injects the X‑Scope‑OrgID header after verifying the user.

Per‑tenant limits are defined under limits_config.per_tenant_config (e.g., retention, ingestion rate, max streams per user).

Alerting

Example Loki Ruler rule (saved as /etc/loki/rules/alerts.yaml) that fires when error rate exceeds 0.1 over a 5‑minute window:

groups:
- name: loki_alerts
  interval: 30s
  rules:
  - alert: HighErrorRate
    expr: sum by (namespace, app) (rate({app=~".+"} |= "error" [5m])) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.app }} error rate too high"
      description: "Namespace={{ $labels.namespace }} app={{ $labels.app }} rate={{ $value }}"

Grafana 10+ also supports Unified Alerting; the same expression can be created directly in Grafana Explore.

Troubleshooting Checklist

Check /ready on Loki and Promtail.

Verify Promtail targets via curl http://promtail:3101/targets.

Inspect Loki metrics (e.g., loki_distributor_lines_received_total, loki_ingester_chunks_created_total).

Look for component logs indicating rate limiting, label cardinality errors, or WAL issues.

Confirm object‑store connectivity (bucket, credentials, region).

Review limits_config and label cardinality.

Use logcli or Grafana Explore to run sample queries.

Backup & Rollback

Backup configuration files ( loki.yaml, promtail.yaml), ruler rule files, and Grafana dashboard JSON.

Helm rollback: helm history loki -n logging then helm rollback loki <revision> -n logging.

If schema_config changed, revert to the previous schema before restarting Loki; otherwise a full migration is required.

For emergency read‑only mode, lower ingester.max_transfer_retries and reduce chunk_idle_period.

Integration with Observability Stack

Grafana can add Prometheus, Loki, Tempo, Mimir, and Pyroscope as data sources, enabling unified dashboards and one‑click trace linking via derived fields (e.g., trace_id → Tempo). The combination of Loki (logs), Mimir (metrics), Tempo (traces), and Pyroscope (profiles) forms the full LGTM stack.

Key Takeaways

Select deployment mode based on cluster size.

Use object storage for cheap, durable log chunks.

Control label cardinality to avoid memory explosion.

Enable WAL and appropriate replication for reliability.

Leverage LogQL pipelines for powerful log parsing and metric extraction.

Configure multi‑tenant isolation and secure ingress.

Monitor Loki’s own metrics and set alerts for ingestion rate, storage pressure, and component health.

Maintain backup of configuration and ruler rules; use Helm rollback for quick recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeObservabilitykubernetesgrafanaLokiPromtailLogQL
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.