Operations 32 min read

High‑Availability Log Platform for Ten‑Million‑Level Concurrency

This article presents a comprehensive, production‑grade design for a log collection and analysis pipeline that can reliably handle ten‑million‑level concurrent writes by combining Filebeat, Redis, Logstash, and Elasticsearch with careful buffering, scaling, and observability strategies.

Ray's Galactic Tech

Apr 1, 2026

High‑Availability Log Platform for Ten‑Million‑Level Concurrency

Problem Overview

In high‑traffic production environments a simple log pipeline

application → Filebeat → Logstash → Elasticsearch → Kibana

often collapses. The failure points are:

Application‑side synchronous writes block request threads.

Filebeat output slows down, memory queues overflow and log files grow.

Logstash filter chains (Grok, regex, JSON decode, DNS, Ruby) consume CPU and become a bottleneck.

Elasticsearch write jitter (excessive shards, frequent refresh, heavy merge, disk I/O saturation, mapping explosion, hot‑index concentration) propagates back‑pressure to Logstash and Filebeat.

Missing or inconsistent trace/context fields make root‑cause analysis impossible.

The resulting phenomenon is called a log black hole : logs are silently dropped or delayed when the system is under stress.

Architecture Overview

The production‑grade stack that decouples collection from processing and provides peak‑shaving is:

Application / Nginx : write local JSON logs.

Filebeat : lightweight sidecar/DaemonSet collector.

Redis : fast in‑memory buffer between Filebeat and Logstash.

Logstash : parsing, enrichment, routing, and noise reduction.

Elasticsearch : storage and search.

Kibana : visualization and dashboards.

Prometheus + Grafana : metrics, monitoring and alerts.

Why Insert Redis Between Filebeat and Logstash?

Without a buffer the chain Filebeat → Logstash → Elasticsearch amplifies any downstream slowdown back to the collector. Adding Redis changes the flow to Filebeat → Redis → Logstash → Elasticsearch and provides four benefits:

Decouples collection from processing.

Absorbs short‑term spikes from downstream jitter.

Enables horizontal scaling of Logstash consumers.

Limits fault propagation from Elasticsearch to the application.

Redis vs. Kafka

Both can act as a buffer, but they differ in operational complexity and capabilities.

Redis List : simple, high‑performance, easy to adopt; lacks consumer groups, acknowledgments, and long‑term replay.

Kafka : supports consumer groups, offsets, replay; requires more deployment and operational effort.

For medium‑scale systems where quick rollout and low operational overhead are priorities, Redis List is sufficient. Kafka becomes attractive for multi‑tenant, long‑term retention, or strong replay requirements.

Filebeat Production Configuration

Key configuration points:

Use filestream input for reliable rotation handling.

Set max_retries: -1 for infinite retry.

Adjust bulk_max_size based on average log size and network latency.

Limit harvester_limit to avoid resource exhaustion.

filebeat.inputs:
- type: filestream
  id: nginx-access
  enabled: true
  paths:
    - /var/log/nginx/access_json.log
  parsers:
    - ndjson:
        overwrite_keys: true
        add_error_key: true
        expand_keys: true
  fields:
    log_source: nginx
    cluster: prod-gateway
  fields_under_root: true
  ignore_older: 24h
  close.on_state_change.inactive: 10m
  prospector.scanner.check_interval: 10s
  clean_inactive: 25h
  harvester_limit: 256

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_fields:
      target: ''
      fields:
        pipeline_version: v1

queue.mem:
  events: 8192
  flush.min_events: 2048
  flush.timeout: 1s

output.redis:
  hosts: ["10.0.1.11:6379", "10.0.1.12:6379", "10.0.1.13:6379"]
  key: "logqueue:nginx:access"
  data_type: list
  loadbalance: true
  worker: 4
  timeout: 5s
  backoff.init: 1s
  backoff.max: 30s
  max_retries: -1
  bulk_max_size: 1024
  compression_level: 3
  password: "${REDIS_PASSWORD}"

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Redis Buffer Design

Redis acts only as a fast, observable backlog. It should not be used for long‑term storage or complex queries.

Capacity Estimation

Example: 50 000 QPS, 1.2 KB per request → 60 MB/s. To tolerate 5 minutes of downstream slowdown: 60 MB/s × 300 s = 18 GB Recommended configuration (single master with Sentinel or Cluster):

bind 0.0.0.0
protected-mode yes
port 6379
requirepass yourStrongPassword
masterauth yourStrongPassword
appendonly yes
appendfsync everysec
tcp-keepalive 60
timeout 0

maxmemory 16gb
maxmemory-policy noeviction

repl-backlog-size 512mb
client-output-buffer-limit normal 0 0 0

Important settings: appendonly yes – provides AOF persistence to reduce data loss. maxmemory-policy noeviction – prevents silent log dropping; instead raise alerts.

Logstash Production Design

Enable a persisted local queue to survive temporary Elasticsearch outages.

node.name: logstash-prod-01
pipeline.workers: 8
pipeline.batch.size: 1250
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 16gb
queue.checkpoint.writes: 1024
dead_letter_queue.enable: true
path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
xpack.monitoring.enabled: true

Sample pipeline (JSON input from Redis, enrichment, and Elasticsearch output):

input {
  redis {
    host => "10.0.1.11"
    port => 6379
    password => "${REDIS_PASSWORD}"
    key => "logqueue:nginx:access"
    data_type => "list"
    threads => 4
    batch_count => 500
    codec => json
  }
}

filter {
  if ![@timestamp] {
    mutate { add_field => { "ingest_warning" => "missing_timestamp" } }
  }
  date { match => ["@timestamp", "ISO8601"] timezone => "Asia/Shanghai" }
  mutate { rename => { "request_time" => "[event][duration_seconds]" } }
  ruby {
    code => 'v = event.get("[event][duration_seconds]"); event.set("[event][duration]", (v.to_f * 1_000_000_000).to_i) if v'
  }
  geoip { source => "[client][ip]" target => "[client][geo]" database => "/usr/share/logstash/GeoLite2-City.mmdb" }
  useragent { source => "[user_agent][original]" target => "[user_agent][parsed]" }
  mutate { remove_field => ["event.duration_seconds", "@version", "agent", "ecs", "host.architecture"] }
}

output {
  elasticsearch {
    hosts => ["http://10.0.2.11:9200", "http://10.0.2.12:9200", "http://10.0.2.13:9200"]
    index => "logs-nginx-access-%{+yyyy.MM.dd}"
    user => "${ES_USER}"
    password => "${ES_PASSWORD}"
    ilm_enabled => true
    ilm_rollover_alias => "logs-nginx-access"
    ilm_policy => "logs_hot_warm_delete_policy"
    manage_template => false
    retry_initial_interval => 2
    retry_max_interval => 64
    pool_max => 2000
    pool_max_per_route => 200
  }
  if "_geoip_lookup_failure" in [tags] or "_jsonparsefailure" in [tags] {
    file { path => "/var/log/logstash/dlq-fallback.log" codec => json_lines }
  }
}

Performance tips:

Prefer JSON at the source; avoid heavy Grok/regex in Logstash.

Trim unnecessary fields early.

Keep GeoIP and user‑agent enrichment in Logstash only for entry logs.

Scale Logstash horizontally before adding more CPU‑intensive filters.

Elasticsearch Index Template & ILM

Explicit mapping prevents mapping explosion and type conflicts.

{
  "index_patterns": ["logs-nginx-access-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "10s",
      "codec": "best_compression"
    },
    "mappings": {
      "dynamic": true,
      "properties": {
        "@timestamp": { "type": "date" },
        "trace_id": { "type": "keyword" },
        "env": { "type": "keyword" },
        "region": { "type": "keyword" },
        "message": { "type": "text" },
        "service": { "properties": { "name": { "type": "keyword" }, "instance": { "properties": { "id": { "type": "keyword" } } } },
        "client": { "properties": { "ip": { "type": "ip" }, "geo": { "properties": { "location": { "type": "geo_point" } } } },
        "http": { "properties": { "request": { "properties": { "method": { "type": "keyword" } } }, "response": { "properties": { "status_code": { "type": "integer" } } } },
        "event": { "properties": { "duration": { "type": "long" } } },
        "url": { "properties": { "path": { "type": "keyword" } } }
      }
    }
  }
}

ILM policy separates hot, warm, and delete phases to control storage cost.

{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_primary_shard_size": "30gb", "max_age": "1d" } } },
      "warm": { "min_age": "3d", "actions": { "allocate": { "require": { "data": "warm" } }, "forcemerge": { "max_num_segments": 1 } } },
      "delete": { "min_age": "30d", "actions": { "delete": {} } }
    }
  }
}

Performance, Scaling & Capacity Planning

Four dimensions guide capacity planning:

Write rate (TPS and average log size).

Peak‑to‑baseline ratio.

Acceptable downstream jitter duration.

Hot‑data query frequency.

Formulas (rough estimates):

Redis buffer size = peak log rate × tolerated seconds × safety factor.

Logstash throughput = nodes × TPS > peak TPS × 1.2.

Elasticsearch bulk write rate should stay below 70 % of the cluster’s stable write capacity.

Noise Reduction & Sampling

Retention policy by log level: ERROR: keep all. WARN: keep high‑value services fully, sample low‑value services. INFO: keep only audit/transaction logs. DEBUG: discard by default; enable temporarily if needed.

Drop health‑check, metrics, and static‑resource logs directly in Filebeat or Logstash:

filter {
  if [url][path] =~ "^/health$" or [url][path] =~ "^/metrics$" {
    drop { }
  }
}

Monitoring & Alerting

Key metrics to monitor:

Filebeat : number of watched files, output failures, publish latency, registry updates, in‑memory queue depth.

Redis : used_memory, used_memory_peak, connected_clients, ops/sec, instantaneous_ops_per_sec, master_link_status, backlog length, command latency.

Logstash : pipeline events in/out, persisted‑queue size, filter latency, JVM heap usage, dead‑letter‑queue growth.

Elasticsearch : bulk write latency, rejected count, indexing latency, merge pressure, shard count, disk usage.

Suggested alert thresholds (example):

Redis backlog grows > 1 minute of traffic → warning.

Redis backlog > 5 minutes of traffic → critical.

Logstash persisted‑queue usage > 70 % → warning.

Elasticsearch bulk rejections exceed threshold in 5 minutes → critical.

Log collection latency > 2 minutes → warning.

Real‑World Case Study: E‑Commerce Flash Sale

During a promotion the gateway reached 68 k QPS, 1.5 KB per log (≈102 MB/s). The original pipeline suffered ES write jitter, Logstash blockage, and Filebeat backlog, causing >15 minute query delays.

Remediation steps:

Add Redis buffer.

Enable Logstash persistent queue.

Separate access and error logs into different Redis keys/pipelines.

Drop health‑check and static‑resource logs.

Split indices by business line and enable ILM rollover.

Run GeoIP and user‑agent parsing only on entry logs.

Create alerts linking Redis backlog growth to ES rejections.

Results:

Redis absorbed a 6‑minute spike without data loss.

Logstash horizontal scaling reduced recovery time from 18 min to 4 min.

Kibana query latency stabilized at 1‑2 minutes.

Elasticsearch write rejections dropped dramatically.

Low‑value log volume decreased ~37 % and storage cost fell ~28 %.

Evolution Path

Upgrade Redis to Kafka when long‑term retention, multi‑consumer groups, or replay become required.

Replace Filebeat with OpenTelemetry Collector to unify logs, metrics, and traces.

Adopt a dual‑storage model: Elasticsearch for real‑time search, ClickHouse for large‑scale analytics.

Deploy all components as native Kubernetes workloads (DaemonSet for Filebeat, Deployment with HPA for Logstash, Redis Sentinel or managed Redis, Elasticsearch Operator).

Conclusion

True production‑grade logging is not just a collection of components; it is an engineered data pipeline that guarantees stability, scalability, governance, and observability even under extreme traffic, downstream failures, and cost constraints.

Redis High Concurrency ELK Logstash Filebeat

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.