Cloud Native 23 min read

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

Introduction

In the cloud‑native era, log data grows exponentially; a medium‑size internet company can generate several terabytes of logs daily. Traditional ELK stacks become prohibitively expensive and complex at PB scale, while Grafana Loki, using label‑only indexing and object‑storage back‑ends, can lower storage and compute costs dramatically.

Technical Background

Problems with Traditional Log Solutions

Elasticsearch’s full‑text indexing inflates storage to 3‑5× raw size and requires large memory allocations (1/20‑1/30 of data size), leading to monthly costs of tens of thousands of dollars.

Loki Design Philosophy

Loki indexes only metadata (labels) and stores compressed log chunks in object storage, achieving three core benefits: drastically reduced storage cost (1.2‑1.5× raw size), lower operational complexity, and independent horizontal scaling of compute and storage.

S3 Object Storage Advantages

Amazon S3 and compatible services (MinIO, OSS, COS) offer low‑cost, highly durable storage. Standard storage costs ~0.15‑0.25 CNY/GB/month, while SSD cloud disks cost 1‑2 CNY/GB/month. Tiered storage further reduces costs for hot, warm, and cold data.

Core Content

Loki Architecture

Loki follows a micro‑service model with four main components:

Distributor : receives log streams, validates labels, balances load across Ingester instances.

Ingester : builds compressed chunks, maintains in‑memory indexes, uploads chunks to object storage, writes index records.

Querier : handles LogQL queries, fetches relevant chunks from the index, downloads and decompresses them, then performs filtering and aggregation.

Query Frontend : provides query caching, request splitting, rate‑limiting, and tenant isolation.

Loki+S3 Core Configuration

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    s3:
      endpoint: s3.amazonaws.com
      bucketnames: loki-logs-prod
      region: us-east-1
      access_key_id: ${S3_ACCESS_KEY}
      secret_access_key: ${S3_SECRET_KEY}
      s3forcepathstyle: false
      insecure: false
      replication_factor: 3

ring:
  kvstore:
    store: memberlist

memberlist:
  join_members:
    - loki-1:7946
    - loki-2:7946
    - loki-3:7946

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v12
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

aws:
  s3: s3://loki-logs-prod
  sse_encryption: true

ingester:
  chunk_idle_period: 30m
  chunk_block_size: 262144
  chunk_target_size: 1572864
  chunk_retain_period: 5m
  max_transfer_retries: 0
  lifecycler:
    ring:
      replication_factor: 3

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 50
  ingestion_burst_size_mb: 100
  max_query_series: 10000
  max_query_parallelism: 32
  max_streams_per_user: 0
  max_global_streams_per_user: 100000
  max_query_lookback: 720h

chunk_store_config:
  max_look_back_period: 720h
  chunk_cache_config:
    enable_fifocache: true
    fifocache:
      max_size_bytes: 2GB
      ttl: 24h

query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        max_size_bytes: 1GB
        ttl: 24h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

table_manager:
  retention_deletes_enabled: true
  retention_period: 2160h

Docker‑Compose Deployment

# docker-compose.yml
version: '3.8'

services:
  loki:
    image: grafana/loki:2.9.3
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/loki-config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/loki-config.yaml
    environment:
      - S3_ACCESS_KEY=${S3_ACCESS_KEY}
      - S3_SECRET_KEY=${S3_SECRET_KEY}
    networks:
      - loki
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.9.3
    container_name: promtail
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yaml:/etc/promtail/promtail-config.yaml
    command: -config.file=/etc/promtail/promtail-config.yaml
    networks:
      - loki
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    networks:
      - loki
    restart: unless-stopped

volumes:
  loki-data:
  grafana-data:

networks:
  loki:
    driver: bridge

Promtail Collection Configuration

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push
    batchwait: 1s
    batchsize: 1048576
    timeout: 10s

scrape_configs:
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*log

  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: container
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: stream
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: project
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: service

  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          host: ${HOSTNAME}
          __path__: /var/log/nginx/*.log
    pipeline_stages:
      - regex:
          expression: '^(?P<remote_addr>[\w\.]+) - (?P<remote_user>[^ ]*) \[(?P<time_local>.*)\] "(?P<method>[^ ]*) (?P<request>[^ ]*) (?P<protocol>[^ ]*)" (?P<status>[\d]+) (?P<body_bytes_sent>[\d]+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"'
      - labels:
          method:
          status:
      - metrics:
          nginx_request_total:
            type: Counter
            description: "Total nginx requests"
            source: status
            config:
              action: inc

S3 Lifecycle Policy Configuration

# Create S3 lifecycle policy JSON
cat > s3-lifecycle-policy.json <<'EOF'
{
  "Rules": [
    {
      "Id": "LokiHotDataRule",
      "Status": "Enabled",
      "Filter": {"Prefix": "fake/"},
      "Transitions": [
        {"Days": 30, "StorageClass": "STANDARD_IA"},
        {"Days": 90, "StorageClass": "GLACIER_IR"},
        {"Days": 180, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 730}
    },
    {
      "Id": "LokiIndexRule",
      "Status": "Enabled",
      "Filter": {"Prefix": "index/"},
      "Transitions": [{"Days": 90, "StorageClass": "STANDARD_IA"}],
      "Expiration": {"Days": 730}
    }
  ]
}
EOF

# Apply lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket loki-logs-prod \
  --lifecycle-configuration file://s3-lifecycle-policy.json

# Enable server‑side encryption
aws s3api put-bucket-encryption \
  --bucket loki-logs-prod \
  --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

# Enable versioning (optional)
aws s3api put-bucket-versioning \
  --bucket loki-logs-prod \
  --versioning-configuration Status=Enabled

MinIO as S3‑Compatible Storage

# Deploy a MinIO cluster quickly
docker run -d \
  --name minio \
  -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=admin \
  -e MINIO_ROOT_PASSWORD=Admin@123456 \
  -v /data/minio:/data \
  minio/minio server /data --console-address ":9001"

# Create a bucket for Loki logs
docker exec minio mc alias set local http://localhost:9000 admin Admin@123456
docker exec minio mc mb local/loki-logs
docker exec minio mc anonymous set download local/loki-logs

# Add a lifecycle rule to MinIO
docker exec minio mc ilm add local/loki-logs \
  --transition-days 30 \
  --transition-tier "WARM" \
  --expiry-days 730

Practical Cases

Case 1: E‑commerce Platform Log Architecture Upgrade

The platform generated 15 TB of raw logs daily, requiring a 24‑node Elasticsearch cluster (32 CPU × 128 GB RAM per node, 10 TB SSD each) and incurring ~180 000 CNY/month. After migrating to a 3‑node Loki cluster (8 CPU × 32 GB, 500 GB SSD cache) with S3 storage, monthly cost dropped to ~28 000 CNY, a saving of 84 %.

Compute resources reduced by 87.5 %.

Hot SSD storage reduced from 240 TB to 1.5 TB (99.4 % saving).

Monthly cost reduced from 180 k CNY to 28 k CNY.

Query P95 latency improved from 15‑30 s to 2‑5 s (≈83 % faster).

Case 2: Financial Industry Compliance Log Storage

A city‑commercial bank must retain 5 years of audit logs (~500 TB/year). Loki is configured with a 5‑year retention period, S3 cross‑region replication, server‑side encryption, and object lock for immutability. Using deep‑archive storage (≈0.03 CNY/GB/month) the monthly cost is ~75 000 CNY, versus >1 000 000 CNY for an equivalent Elasticsearch solution—a 92.5 % reduction.

# Loki compliance config snippet
compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 2h
  retention_enabled: true
  retention_delete_delay: 24h
  retention_delete_worker_count: 50

limits_config:
  retention_period: 1825d   # 5 years
  max_query_lookback: 1825d

# S3 cross‑region replication JSON (example)
{
  "Role": "arn:aws:iam::ACCOUNT-ID:role/s3-replication-role",
  "Rules": [{
    "Status": "Enabled",
    "Priority": 1,
    "Filter": {"Prefix": ""},
    "Destination": {"Bucket": "arn:aws:s3:::loki-logs-backup-region"},
    "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
    "Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}},
    "DeleteMarkerReplication": {"Status": "Enabled"}
  }]
}

aws s3api put-bucket-replication \
  --bucket loki-logs-prod \
  --replication-configuration file://s3-replication-config.json

Case 3: Multi‑tenant SaaS Platform Log Isolation

Loki is enabled with authentication and per‑tenant rate limits. An Nginx reverse proxy injects the tenant ID into the X‑Scope‑OrgID header, and separate limits are defined for each tenant.

# Loki multi‑tenant config
auth_enabled: true
server:
  http_listen_port: 3100

limits_config:
  split_queries_by_interval: 24h
  max_query_parallelism: 32
  max_streams_per_user: 10000
  ingestion_rate_strategy: global
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  overrides:
    "tenant-001":
      ingestion_rate_mb: 50
      ingestion_burst_size_mb: 100
      max_query_parallelism: 64
    "tenant-002":
      ingestion_rate_mb: 20
      ingestion_burst_size_mb: 40
    "tenant-vip":
      ingestion_rate_mb: 200
      ingestion_burst_size_mb: 400
      max_query_parallelism: 128
# Nginx reverse‑proxy snippet
upstream loki_backend {
  least_conn;
  server loki-1:3100 max_fails=3 fail_timeout=30s;
  server loki-2:3100 max_fails=3 fail_timeout=30s;
  server loki-3:3100 max_fails=3 fail_timeout=30s;
}

map $http_authorization $tenant_id {
  default "anonymous";
  ~Bearer\ tenant-(?<tid>.+)-.* $tid;
}

server {
  listen 80;
  server_name loki.example.com;

  location /loki/api/v1/push {
    proxy_pass http://loki_backend;
    proxy_set_header X-Scope-OrgID $tenant_id;
    limit_req zone=tenant_limit burst=20 nodelay;
    client_max_body_size 10M;
    proxy_read_timeout 300s;
    proxy_connect_timeout 60s;
  }

  location /loki/api/v1/query_range {
    proxy_pass http://loki_backend;
    proxy_set_header X-Scope-OrgID $tenant_id;
    limit_req zone=query_limit burst=10 nodelay;
    proxy_read_timeout 600s;
  }
}

limit_req_zone $tenant_id zone=tenant_limit:10m rate=100r/s;
limit_req_zone $tenant_id zone=query_limit:10m rate=10r/s;
# LogQL best‑practice examples
{namespace="production", app="api-gateway"} |= "error" | json | status >= 500
{job="varlogs"} |= "error"   # avoid full‑scan
{app="nginx"}[5m]               # limit to recent 5 min
sum by (status) (rate({app="api"}[5m]))
{app="nginx"} | json | line_format "{{.method}} {{.path}} {{.status}}" | status >= 400
curl -G -s "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={job="loki"} |= "query_range_seconds"' | jq '.data.result[] .values[] .[1]'

Monitoring and Alerting

# Prometheus scrape config for Loki
scrape_configs:
  - job_name: 'loki'
    static_configs:
      - targets: ['loki-1:3100', 'loki-2:3100', 'loki-3:3100']

# Alerting rules (example)
groups:
  - name: loki-alerts
    interval: 30s
    rules:
      - alert: LokiRequestErrors
        expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job) /
              sum(rate(loki_request_duration_seconds_count[5m])) by (job) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Loki high error rate"
          description: "{{ $labels.job }} error rate exceeds 5%"
      - alert: LokiIngesterUnhealthy
        expr: loki_ingester_flush_failed_chunks_total > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Loki Ingester flush failures"
      - alert: LokiS3UploadFailure
        expr: rate(loki_chunk_store_index_entries_per_chunk_count{operation="store_chunk",status_code!~"2.."}[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "S3 upload failure"

Summary and Outlook

Loki + S3 achieves 80‑90 % cost reduction for PB‑scale log storage while preserving or improving query performance. Its label‑only indexing, decoupled compute/storage, and native multi‑tenant support make it a compelling alternative to traditional ELK stacks.

Cost Optimization : Object‑storage cost is only 1/10‑1/30 of SSD storage.

Simplified Operations : No need to manage shards, replicas, or index rebuilds; object storage provides high availability.

Flexible Scaling : Compute and storage can be scaled independently.

Long‑term Retention : Low‑cost tiers enable years of log retention for compliance.

Multi‑tenant Support : Native tenant isolation and quota management.

Best‑practice recommendations:

Design a low‑cardinality label schema.

Use S3 lifecycle policies for automated tiered storage.

Tune chunk size and retention periods to balance write performance and storage efficiency.

Deploy a dedicated Query Frontend for caching and rate‑limiting.

Implement comprehensive monitoring and alerting for ingestion, storage, and query health.

Future trends include AI‑driven log analysis, real‑time stream processing, eBPF‑based zero‑intrusion collection, unified observability (logs + metrics + traces), and edge‑node log processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ObservabilityCost OptimizationLog ManagementS3Loki
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.