High‑Availability Log Platform for Ten‑Million‑Level Concurrency
This article presents a comprehensive, production‑grade design for a log collection and analysis pipeline that can reliably handle ten‑million‑level concurrent writes by combining Filebeat, Redis, Logstash, and Elasticsearch with careful buffering, scaling, and observability strategies.
Problem Overview
In high‑traffic production environments a simple log pipeline
application → Filebeat → Logstash → Elasticsearch → Kibanaoften collapses. The failure points are:
Application‑side synchronous writes block request threads.
Filebeat output slows down, memory queues overflow and log files grow.
Logstash filter chains (Grok, regex, JSON decode, DNS, Ruby) consume CPU and become a bottleneck.
Elasticsearch write jitter (excessive shards, frequent refresh, heavy merge, disk I/O saturation, mapping explosion, hot‑index concentration) propagates back‑pressure to Logstash and Filebeat.
Missing or inconsistent trace/context fields make root‑cause analysis impossible.
The resulting phenomenon is called a log black hole : logs are silently dropped or delayed when the system is under stress.
Architecture Overview
The production‑grade stack that decouples collection from processing and provides peak‑shaving is:
Application / Nginx : write local JSON logs.
Filebeat : lightweight sidecar/DaemonSet collector.
Redis : fast in‑memory buffer between Filebeat and Logstash.
Logstash : parsing, enrichment, routing, and noise reduction.
Elasticsearch : storage and search.
Kibana : visualization and dashboards.
Prometheus + Grafana : metrics, monitoring and alerts.
Why Insert Redis Between Filebeat and Logstash?
Without a buffer the chain Filebeat → Logstash → Elasticsearch amplifies any downstream slowdown back to the collector. Adding Redis changes the flow to Filebeat → Redis → Logstash → Elasticsearch and provides four benefits:
Decouples collection from processing.
Absorbs short‑term spikes from downstream jitter.
Enables horizontal scaling of Logstash consumers.
Limits fault propagation from Elasticsearch to the application.
Redis vs. Kafka
Both can act as a buffer, but they differ in operational complexity and capabilities.
Redis List : simple, high‑performance, easy to adopt; lacks consumer groups, acknowledgments, and long‑term replay.
Kafka : supports consumer groups, offsets, replay; requires more deployment and operational effort.
For medium‑scale systems where quick rollout and low operational overhead are priorities, Redis List is sufficient. Kafka becomes attractive for multi‑tenant, long‑term retention, or strong replay requirements.
Filebeat Production Configuration
Key configuration points:
Use filestream input for reliable rotation handling.
Set max_retries: -1 for infinite retry.
Adjust bulk_max_size based on average log size and network latency.
Limit harvester_limit to avoid resource exhaustion.
filebeat.inputs:
- type: filestream
id: nginx-access
enabled: true
paths:
- /var/log/nginx/access_json.log
parsers:
- ndjson:
overwrite_keys: true
add_error_key: true
expand_keys: true
fields:
log_source: nginx
cluster: prod-gateway
fields_under_root: true
ignore_older: 24h
close.on_state_change.inactive: 10m
prospector.scanner.check_interval: 10s
clean_inactive: 25h
harvester_limit: 256
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_fields:
target: ''
fields:
pipeline_version: v1
queue.mem:
events: 8192
flush.min_events: 2048
flush.timeout: 1s
output.redis:
hosts: ["10.0.1.11:6379", "10.0.1.12:6379", "10.0.1.13:6379"]
key: "logqueue:nginx:access"
data_type: list
loadbalance: true
worker: 4
timeout: 5s
backoff.init: 1s
backoff.max: 30s
max_retries: -1
bulk_max_size: 1024
compression_level: 3
password: "${REDIS_PASSWORD}"
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644Redis Buffer Design
Redis acts only as a fast, observable backlog. It should not be used for long‑term storage or complex queries.
Capacity Estimation
Example: 50 000 QPS, 1.2 KB per request → 60 MB/s. To tolerate 5 minutes of downstream slowdown: 60 MB/s × 300 s = 18 GB Recommended configuration (single master with Sentinel or Cluster):
bind 0.0.0.0
protected-mode yes
port 6379
requirepass yourStrongPassword
masterauth yourStrongPassword
appendonly yes
appendfsync everysec
tcp-keepalive 60
timeout 0
maxmemory 16gb
maxmemory-policy noeviction
repl-backlog-size 512mb
client-output-buffer-limit normal 0 0 0Important settings: appendonly yes – provides AOF persistence to reduce data loss. maxmemory-policy noeviction – prevents silent log dropping; instead raise alerts.
Logstash Production Design
Enable a persisted local queue to survive temporary Elasticsearch outages.
node.name: logstash-prod-01
pipeline.workers: 8
pipeline.batch.size: 1250
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 16gb
queue.checkpoint.writes: 1024
dead_letter_queue.enable: true
path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
xpack.monitoring.enabled: trueSample pipeline (JSON input from Redis, enrichment, and Elasticsearch output):
input {
redis {
host => "10.0.1.11"
port => 6379
password => "${REDIS_PASSWORD}"
key => "logqueue:nginx:access"
data_type => "list"
threads => 4
batch_count => 500
codec => json
}
}
filter {
if ![@timestamp] {
mutate { add_field => { "ingest_warning" => "missing_timestamp" } }
}
date { match => ["@timestamp", "ISO8601"] timezone => "Asia/Shanghai" }
mutate { rename => { "request_time" => "[event][duration_seconds]" } }
ruby {
code => 'v = event.get("[event][duration_seconds]"); event.set("[event][duration]", (v.to_f * 1_000_000_000).to_i) if v'
}
geoip { source => "[client][ip]" target => "[client][geo]" database => "/usr/share/logstash/GeoLite2-City.mmdb" }
useragent { source => "[user_agent][original]" target => "[user_agent][parsed]" }
mutate { remove_field => ["event.duration_seconds", "@version", "agent", "ecs", "host.architecture"] }
}
output {
elasticsearch {
hosts => ["http://10.0.2.11:9200", "http://10.0.2.12:9200", "http://10.0.2.13:9200"]
index => "logs-nginx-access-%{+yyyy.MM.dd}"
user => "${ES_USER}"
password => "${ES_PASSWORD}"
ilm_enabled => true
ilm_rollover_alias => "logs-nginx-access"
ilm_policy => "logs_hot_warm_delete_policy"
manage_template => false
retry_initial_interval => 2
retry_max_interval => 64
pool_max => 2000
pool_max_per_route => 200
}
if "_geoip_lookup_failure" in [tags] or "_jsonparsefailure" in [tags] {
file { path => "/var/log/logstash/dlq-fallback.log" codec => json_lines }
}
}Performance tips:
Prefer JSON at the source; avoid heavy Grok/regex in Logstash.
Trim unnecessary fields early.
Keep GeoIP and user‑agent enrichment in Logstash only for entry logs.
Scale Logstash horizontally before adding more CPU‑intensive filters.
Elasticsearch Index Template & ILM
Explicit mapping prevents mapping explosion and type conflicts.
{
"index_patterns": ["logs-nginx-access-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "10s",
"codec": "best_compression"
},
"mappings": {
"dynamic": true,
"properties": {
"@timestamp": { "type": "date" },
"trace_id": { "type": "keyword" },
"env": { "type": "keyword" },
"region": { "type": "keyword" },
"message": { "type": "text" },
"service": { "properties": { "name": { "type": "keyword" }, "instance": { "properties": { "id": { "type": "keyword" } } } },
"client": { "properties": { "ip": { "type": "ip" }, "geo": { "properties": { "location": { "type": "geo_point" } } } },
"http": { "properties": { "request": { "properties": { "method": { "type": "keyword" } } }, "response": { "properties": { "status_code": { "type": "integer" } } } },
"event": { "properties": { "duration": { "type": "long" } } },
"url": { "properties": { "path": { "type": "keyword" } } }
}
}
}
}ILM policy separates hot, warm, and delete phases to control storage cost.
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_primary_shard_size": "30gb", "max_age": "1d" } } },
"warm": { "min_age": "3d", "actions": { "allocate": { "require": { "data": "warm" } }, "forcemerge": { "max_num_segments": 1 } } },
"delete": { "min_age": "30d", "actions": { "delete": {} } }
}
}
}Performance, Scaling & Capacity Planning
Four dimensions guide capacity planning:
Write rate (TPS and average log size).
Peak‑to‑baseline ratio.
Acceptable downstream jitter duration.
Hot‑data query frequency.
Formulas (rough estimates):
Redis buffer size = peak log rate × tolerated seconds × safety factor.
Logstash throughput = nodes × TPS > peak TPS × 1.2.
Elasticsearch bulk write rate should stay below 70 % of the cluster’s stable write capacity.
Noise Reduction & Sampling
Retention policy by log level: ERROR: keep all. WARN: keep high‑value services fully, sample low‑value services. INFO: keep only audit/transaction logs. DEBUG: discard by default; enable temporarily if needed.
Drop health‑check, metrics, and static‑resource logs directly in Filebeat or Logstash:
filter {
if [url][path] =~ "^/health$" or [url][path] =~ "^/metrics$" {
drop { }
}
}Monitoring & Alerting
Key metrics to monitor:
Filebeat : number of watched files, output failures, publish latency, registry updates, in‑memory queue depth.
Redis : used_memory, used_memory_peak, connected_clients, ops/sec, instantaneous_ops_per_sec, master_link_status, backlog length, command latency.
Logstash : pipeline events in/out, persisted‑queue size, filter latency, JVM heap usage, dead‑letter‑queue growth.
Elasticsearch : bulk write latency, rejected count, indexing latency, merge pressure, shard count, disk usage.
Suggested alert thresholds (example):
Redis backlog grows > 1 minute of traffic → warning.
Redis backlog > 5 minutes of traffic → critical.
Logstash persisted‑queue usage > 70 % → warning.
Elasticsearch bulk rejections exceed threshold in 5 minutes → critical.
Log collection latency > 2 minutes → warning.
Real‑World Case Study: E‑Commerce Flash Sale
During a promotion the gateway reached 68 k QPS, 1.5 KB per log (≈102 MB/s). The original pipeline suffered ES write jitter, Logstash blockage, and Filebeat backlog, causing >15 minute query delays.
Remediation steps:
Add Redis buffer.
Enable Logstash persistent queue.
Separate access and error logs into different Redis keys/pipelines.
Drop health‑check and static‑resource logs.
Split indices by business line and enable ILM rollover.
Run GeoIP and user‑agent parsing only on entry logs.
Create alerts linking Redis backlog growth to ES rejections.
Results:
Redis absorbed a 6‑minute spike without data loss.
Logstash horizontal scaling reduced recovery time from 18 min to 4 min.
Kibana query latency stabilized at 1‑2 minutes.
Elasticsearch write rejections dropped dramatically.
Low‑value log volume decreased ~37 % and storage cost fell ~28 %.
Evolution Path
Upgrade Redis to Kafka when long‑term retention, multi‑consumer groups, or replay become required.
Replace Filebeat with OpenTelemetry Collector to unify logs, metrics, and traces.
Adopt a dual‑storage model: Elasticsearch for real‑time search, ClickHouse for large‑scale analytics.
Deploy all components as native Kubernetes workloads (DaemonSet for Filebeat, Deployment with HPA for Logstash, Redis Sentinel or managed Redis, Elasticsearch Operator).
Conclusion
True production‑grade logging is not just a collection of components; it is an engineered data pipeline that guarantees stability, scalability, governance, and observability even under extreme traffic, downstream failures, and cost constraints.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
