How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale
This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.
Introduction
In the cloud‑native era, log data grows exponentially; a medium‑size internet company can generate several terabytes of logs daily. Traditional ELK stacks become prohibitively expensive and complex at PB scale, while Grafana Loki, using label‑only indexing and object‑storage back‑ends, can lower storage and compute costs dramatically.
Technical Background
Problems with Traditional Log Solutions
Elasticsearch’s full‑text indexing inflates storage to 3‑5× raw size and requires large memory allocations (1/20‑1/30 of data size), leading to monthly costs of tens of thousands of dollars.
Loki Design Philosophy
Loki indexes only metadata (labels) and stores compressed log chunks in object storage, achieving three core benefits: drastically reduced storage cost (1.2‑1.5× raw size), lower operational complexity, and independent horizontal scaling of compute and storage.
S3 Object Storage Advantages
Amazon S3 and compatible services (MinIO, OSS, COS) offer low‑cost, highly durable storage. Standard storage costs ~0.15‑0.25 CNY/GB/month, while SSD cloud disks cost 1‑2 CNY/GB/month. Tiered storage further reduces costs for hot, warm, and cold data.
Core Content
Loki Architecture
Loki follows a micro‑service model with four main components:
Distributor : receives log streams, validates labels, balances load across Ingester instances.
Ingester : builds compressed chunks, maintains in‑memory indexes, uploads chunks to object storage, writes index records.
Querier : handles LogQL queries, fetches relevant chunks from the index, downloads and decompresses them, then performs filtering and aggregation.
Query Frontend : provides query caching, request splitting, rate‑limiting, and tenant isolation.
Loki+S3 Core Configuration
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
s3:
endpoint: s3.amazonaws.com
bucketnames: loki-logs-prod
region: us-east-1
access_key_id: ${S3_ACCESS_KEY}
secret_access_key: ${S3_SECRET_KEY}
s3forcepathstyle: false
insecure: false
replication_factor: 3
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- loki-1:7946
- loki-2:7946
- loki-3:7946
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
aws:
s3: s3://loki-logs-prod
sse_encryption: true
ingester:
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_target_size: 1572864
chunk_retain_period: 5m
max_transfer_retries: 0
lifecycler:
ring:
replication_factor: 3
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 50
ingestion_burst_size_mb: 100
max_query_series: 10000
max_query_parallelism: 32
max_streams_per_user: 0
max_global_streams_per_user: 100000
max_query_lookback: 720h
chunk_store_config:
max_look_back_period: 720h
chunk_cache_config:
enable_fifocache: true
fifocache:
max_size_bytes: 2GB
ttl: 24h
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 5
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 1GB
ttl: 24h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
table_manager:
retention_deletes_enabled: true
retention_period: 2160hDocker‑Compose Deployment
# docker-compose.yml
version: '3.8'
services:
loki:
image: grafana/loki:2.9.3
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/loki-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/loki-config.yaml
environment:
- S3_ACCESS_KEY=${S3_ACCESS_KEY}
- S3_SECRET_KEY=${S3_SECRET_KEY}
networks:
- loki
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.3
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yaml:/etc/promtail/promtail-config.yaml
command: -config.file=/etc/promtail/promtail-config.yaml
networks:
- loki
restart: unless-stopped
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
networks:
- loki
restart: unless-stopped
volumes:
loki-data:
grafana-data:
networks:
loki:
driver: bridgePromtail Collection Configuration
# promtail-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
batchwait: 1s
batchsize: 1048576
timeout: 10s
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*log
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: container
- source_labels: ['__meta_docker_container_log_stream']
target_label: stream
- source_labels: ['__meta_docker_container_label_com_docker_compose_project']
target_label: project
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: service
- job_name: nginx
static_configs:
- targets: [localhost]
labels:
job: nginx
host: ${HOSTNAME}
__path__: /var/log/nginx/*.log
pipeline_stages:
- regex:
expression: '^(?P<remote_addr>[\w\.]+) - (?P<remote_user>[^ ]*) \[(?P<time_local>.*)\] "(?P<method>[^ ]*) (?P<request>[^ ]*) (?P<protocol>[^ ]*)" (?P<status>[\d]+) (?P<body_bytes_sent>[\d]+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"'
- labels:
method:
status:
- metrics:
nginx_request_total:
type: Counter
description: "Total nginx requests"
source: status
config:
action: incS3 Lifecycle Policy Configuration
# Create S3 lifecycle policy JSON
cat > s3-lifecycle-policy.json <<'EOF'
{
"Rules": [
{
"Id": "LokiHotDataRule",
"Status": "Enabled",
"Filter": {"Prefix": "fake/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
{"Days": 180, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 730}
},
{
"Id": "LokiIndexRule",
"Status": "Enabled",
"Filter": {"Prefix": "index/"},
"Transitions": [{"Days": 90, "StorageClass": "STANDARD_IA"}],
"Expiration": {"Days": 730}
}
]
}
EOF
# Apply lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
--bucket loki-logs-prod \
--lifecycle-configuration file://s3-lifecycle-policy.json
# Enable server‑side encryption
aws s3api put-bucket-encryption \
--bucket loki-logs-prod \
--server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# Enable versioning (optional)
aws s3api put-bucket-versioning \
--bucket loki-logs-prod \
--versioning-configuration Status=EnabledMinIO as S3‑Compatible Storage
# Deploy a MinIO cluster quickly
docker run -d \
--name minio \
-p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=admin \
-e MINIO_ROOT_PASSWORD=Admin@123456 \
-v /data/minio:/data \
minio/minio server /data --console-address ":9001"
# Create a bucket for Loki logs
docker exec minio mc alias set local http://localhost:9000 admin Admin@123456
docker exec minio mc mb local/loki-logs
docker exec minio mc anonymous set download local/loki-logs
# Add a lifecycle rule to MinIO
docker exec minio mc ilm add local/loki-logs \
--transition-days 30 \
--transition-tier "WARM" \
--expiry-days 730Practical Cases
Case 1: E‑commerce Platform Log Architecture Upgrade
The platform generated 15 TB of raw logs daily, requiring a 24‑node Elasticsearch cluster (32 CPU × 128 GB RAM per node, 10 TB SSD each) and incurring ~180 000 CNY/month. After migrating to a 3‑node Loki cluster (8 CPU × 32 GB, 500 GB SSD cache) with S3 storage, monthly cost dropped to ~28 000 CNY, a saving of 84 %.
Compute resources reduced by 87.5 %.
Hot SSD storage reduced from 240 TB to 1.5 TB (99.4 % saving).
Monthly cost reduced from 180 k CNY to 28 k CNY.
Query P95 latency improved from 15‑30 s to 2‑5 s (≈83 % faster).
Case 2: Financial Industry Compliance Log Storage
A city‑commercial bank must retain 5 years of audit logs (~500 TB/year). Loki is configured with a 5‑year retention period, S3 cross‑region replication, server‑side encryption, and object lock for immutability. Using deep‑archive storage (≈0.03 CNY/GB/month) the monthly cost is ~75 000 CNY, versus >1 000 000 CNY for an equivalent Elasticsearch solution—a 92.5 % reduction.
# Loki compliance config snippet
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 2h
retention_enabled: true
retention_delete_delay: 24h
retention_delete_worker_count: 50
limits_config:
retention_period: 1825d # 5 years
max_query_lookback: 1825d
# S3 cross‑region replication JSON (example)
{
"Role": "arn:aws:iam::ACCOUNT-ID:role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Priority": 1,
"Filter": {"Prefix": ""},
"Destination": {"Bucket": "arn:aws:s3:::loki-logs-backup-region"},
"ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
"Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}},
"DeleteMarkerReplication": {"Status": "Enabled"}
}]
}
aws s3api put-bucket-replication \
--bucket loki-logs-prod \
--replication-configuration file://s3-replication-config.jsonCase 3: Multi‑tenant SaaS Platform Log Isolation
Loki is enabled with authentication and per‑tenant rate limits. An Nginx reverse proxy injects the tenant ID into the X‑Scope‑OrgID header, and separate limits are defined for each tenant.
# Loki multi‑tenant config
auth_enabled: true
server:
http_listen_port: 3100
limits_config:
split_queries_by_interval: 24h
max_query_parallelism: 32
max_streams_per_user: 10000
ingestion_rate_strategy: global
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
overrides:
"tenant-001":
ingestion_rate_mb: 50
ingestion_burst_size_mb: 100
max_query_parallelism: 64
"tenant-002":
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
"tenant-vip":
ingestion_rate_mb: 200
ingestion_burst_size_mb: 400
max_query_parallelism: 128 # Nginx reverse‑proxy snippet
upstream loki_backend {
least_conn;
server loki-1:3100 max_fails=3 fail_timeout=30s;
server loki-2:3100 max_fails=3 fail_timeout=30s;
server loki-3:3100 max_fails=3 fail_timeout=30s;
}
map $http_authorization $tenant_id {
default "anonymous";
~Bearer\ tenant-(?<tid>.+)-.* $tid;
}
server {
listen 80;
server_name loki.example.com;
location /loki/api/v1/push {
proxy_pass http://loki_backend;
proxy_set_header X-Scope-OrgID $tenant_id;
limit_req zone=tenant_limit burst=20 nodelay;
client_max_body_size 10M;
proxy_read_timeout 300s;
proxy_connect_timeout 60s;
}
location /loki/api/v1/query_range {
proxy_pass http://loki_backend;
proxy_set_header X-Scope-OrgID $tenant_id;
limit_req zone=query_limit burst=10 nodelay;
proxy_read_timeout 600s;
}
}
limit_req_zone $tenant_id zone=tenant_limit:10m rate=100r/s;
limit_req_zone $tenant_id zone=query_limit:10m rate=10r/s; # LogQL best‑practice examples
{namespace="production", app="api-gateway"} |= "error" | json | status >= 500
{job="varlogs"} |= "error" # avoid full‑scan
{app="nginx"}[5m] # limit to recent 5 min
sum by (status) (rate({app="api"}[5m]))
{app="nginx"} | json | line_format "{{.method}} {{.path}} {{.status}}" | status >= 400
curl -G -s "http://localhost:3100/loki/api/v1/query_range" \
--data-urlencode 'query={job="loki"} |= "query_range_seconds"' | jq '.data.result[] .values[] .[1]'Monitoring and Alerting
# Prometheus scrape config for Loki
scrape_configs:
- job_name: 'loki'
static_configs:
- targets: ['loki-1:3100', 'loki-2:3100', 'loki-3:3100']
# Alerting rules (example)
groups:
- name: loki-alerts
interval: 30s
rules:
- alert: LokiRequestErrors
expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job) /
sum(rate(loki_request_duration_seconds_count[5m])) by (job) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Loki high error rate"
description: "{{ $labels.job }} error rate exceeds 5%"
- alert: LokiIngesterUnhealthy
expr: loki_ingester_flush_failed_chunks_total > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Loki Ingester flush failures"
- alert: LokiS3UploadFailure
expr: rate(loki_chunk_store_index_entries_per_chunk_count{operation="store_chunk",status_code!~"2.."}[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "S3 upload failure"Summary and Outlook
Loki + S3 achieves 80‑90 % cost reduction for PB‑scale log storage while preserving or improving query performance. Its label‑only indexing, decoupled compute/storage, and native multi‑tenant support make it a compelling alternative to traditional ELK stacks.
Cost Optimization : Object‑storage cost is only 1/10‑1/30 of SSD storage.
Simplified Operations : No need to manage shards, replicas, or index rebuilds; object storage provides high availability.
Flexible Scaling : Compute and storage can be scaled independently.
Long‑term Retention : Low‑cost tiers enable years of log retention for compliance.
Multi‑tenant Support : Native tenant isolation and quota management.
Best‑practice recommendations:
Design a low‑cardinality label schema.
Use S3 lifecycle policies for automated tiered storage.
Tune chunk size and retention periods to balance write performance and storage efficiency.
Deploy a dedicated Query Frontend for caching and rate‑limiting.
Implement comprehensive monitoring and alerting for ingestion, storage, and query health.
Future trends include AI‑driven log analysis, real‑time stream processing, eBPF‑based zero‑intrusion collection, unified observability (logs + metrics + traces), and edge‑node log processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
