Operations 15 min read

Real‑World Challenges and Solutions When Scaling ELK for Log Management

This article compiles the Q&A from a high‑level operations talk where experts discuss ELK cluster sizing, performance tuning, schema handling, integration with Hadoop and Splunk, and practical tips for managing massive log streams in production environments.

Efficient Ops
Efficient Ops
Efficient Ops
Real‑World Challenges and Solutions When Scaling ELK for Log Management

This article compiles the insights and discussions from the “Efficient Operations” talk series, featuring guest speakers from leading Chinese tech companies, focusing on practical experiences with the ELK stack for log management.

Key Participants

Rao Chenlin @ Sina – Beijing

Huang Wei @ Kuaiqian – Shanghai

Zhao Jianpeng @ Cheetah

Wei Zhenhua @ Tencent – Shenzhen

Wu Xiaogang @ Ctrip – Shanghai

Dong Yaosen – aix operations

Editors

Xu Wenhui @ 21v – Beijing (content collection)

Wen Guobing @ Spacewalk – Guangzhou (article editing & publishing)

Guest Introduction

Rao Chenlin, system architect at Sina Technology Assurance Department, Perl programmer, author of "Website Operations Technology and Practice", former positions at Yun KuaiXian, Renren, focusing on CDN and automation operations, recently researching log processing and monitoring.

Part 1: Current Situation and Scale

Q1: How many machines can your cluster handle? A1: Our cluster currently has 26 Datanodes. Ctrip operates a similar scale, while JD has about 100 nodes split across clusters. Internationally, the largest reported ELK deployments are 1,000 and 200 nodes, often on EC2.

Q2: At what size do clusters start to show issues? A2: Users report that around 100‑200 nodes, cluster state propagation and fault detection begin to degrade, prompting a split.

Q3: How many log records per second can a single ES node ingest? A3: Our Nginx logs (~600 B) reach about 8 k records/s per node, peaking at 13 k during stress. IIS logs (200 B) achieve 10‑15 k records/s.

Part 2: Comparison with Hadoop and Splunk

Q4: How does ELK differ from Hadoop for log processing? A4: Splunk generates charts via query language (e.g., *** | timechart), while ELK relies on mouse‑driven selections; ES queries have a lower learning curve.

Q5: What are the main differences between Splunk and ELK? A5: Splunk performs many real‑time transformations after ingestion (e.g., rex), whereas ELK encourages preprocessing in Logstash, Fluentd, or rsyslog before indexing; post‑ingest actions can be done with Groovy scripting, though with performance and security trade‑offs.

Q6: ELK retains data for only 7 days—how do you handle historical data? A6: ELK is not designed for long‑term offline analysis; for data older than a week you must export to Hadoop or HDFS for deeper queries.

Log size inflation is typical; the default Logstash template can triple raw size. Our smallest logs average 510 B after compression.

Q7: Do you store ELK data on HDFS? A7: No. ELK and HDFS are separate; we snapshot indices to HDFS or use es_hadoop.jar with Spark to move data, but indices are not built directly on HDFS.

Part 3: ELK Operational Challenges

Q8: What is the biggest challenge in operating ELK? A8: Managing the Master node, which only maintains cluster state, can be stressful when mappings change frequently; schema‑less JSON ingestion can still cause mapping conflicts, leading to write failures.

Q9: How to mitigate rapid index growth? A9: Disable the default _all field if not needed, customize templates to avoid unnecessary sub‑fields, and turn off unused analyzers to reduce bloat.

Q10: Principles for tuning ES memory? A10: Use versions that support doc_values, which reduces memory pressure.

Q11: What is your merge strategy? A11: Default merge settings work well if refresh and flush intervals are tuned; monitor segment merge speed versus indexing rate and adjust indices.store.throttle.max_bytes_per_sec if needed.

Q12: What happens if data nodes have mismatched disk capacities? A12: ES stops allocating shards to a path once it reaches ~85 % usage, preventing overload on smaller disks.

Part 4: Logstash and Kibana Details

Q13: How long does optimizing historical data take? A13: Optimizing a day's worth of data can take more than six hours; it is a slow process.

Q14: How do you decide the number of shards? A14: Perform single‑node, single‑index benchmarks; increase shards when performance degrades. Our largest index uses a single 50 GB shard.

Q15: How to improve insert performance without Logstash? A15: Adjust refresh_interval and bulk size (≈10‑15 MB per bulk), handle 429 errors, and test until a performance breakpoint is reached.

Q16: Which ES nodes does Logstash push logs to? A16: Typically to client nodes; some prefer data nodes for lower latency, but network topology may dictate the choice.

Q17: Can ELK integrate with monitoring systems for alerts? A17: Yes; we use Zabbix with cron jobs that query 200 aggregation endpoints every five minutes and trigger alerts based on response times.

Q18: How to generate function‑call chain visualizations? A18: See the diagram below.

Q19: Where to implement GeoIP for user map visualizations? A19: Parse GeoIP in Logstash, then visualize in Kibana; we also built a custom plugin using MaxMind DB for better performance.

Q20: How to ensure unique index names? A20: Define naming conventions in Logstash/Flume/etc., e.g., name-yyyy.mm.dd.

Q21: What are the fundamental performance limits of ES? A21: Since ES is built on Lucene, some bottlenecks are inherent; upcoming versions aim to reduce memory usage by removing the IndexWriter for historical data.

--- End of Q&A ---

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchELKLog Managementscaling
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.