Real‑World Challenges and Solutions When Scaling ELK for Log Management
This article compiles the Q&A from a high‑level operations talk where experts discuss ELK cluster sizing, performance tuning, schema handling, integration with Hadoop and Splunk, and practical tips for managing massive log streams in production environments.
This article compiles the insights and discussions from the “Efficient Operations” talk series, featuring guest speakers from leading Chinese tech companies, focusing on practical experiences with the ELK stack for log management.
Key Participants
Rao Chenlin @ Sina – Beijing
Huang Wei @ Kuaiqian – Shanghai
Zhao Jianpeng @ Cheetah
Wei Zhenhua @ Tencent – Shenzhen
Wu Xiaogang @ Ctrip – Shanghai
Dong Yaosen – aix operations
Editors
Xu Wenhui @ 21v – Beijing (content collection)
Wen Guobing @ Spacewalk – Guangzhou (article editing & publishing)
Guest Introduction
Rao Chenlin, system architect at Sina Technology Assurance Department, Perl programmer, author of "Website Operations Technology and Practice", former positions at Yun KuaiXian, Renren, focusing on CDN and automation operations, recently researching log processing and monitoring.
Part 1: Current Situation and Scale
Q1: How many machines can your cluster handle? A1: Our cluster currently has 26 Datanodes. Ctrip operates a similar scale, while JD has about 100 nodes split across clusters. Internationally, the largest reported ELK deployments are 1,000 and 200 nodes, often on EC2.
Q2: At what size do clusters start to show issues? A2: Users report that around 100‑200 nodes, cluster state propagation and fault detection begin to degrade, prompting a split.
Q3: How many log records per second can a single ES node ingest? A3: Our Nginx logs (~600 B) reach about 8 k records/s per node, peaking at 13 k during stress. IIS logs (200 B) achieve 10‑15 k records/s.
Part 2: Comparison with Hadoop and Splunk
Q4: How does ELK differ from Hadoop for log processing? A4: Splunk generates charts via query language (e.g., *** | timechart), while ELK relies on mouse‑driven selections; ES queries have a lower learning curve.
Q5: What are the main differences between Splunk and ELK? A5: Splunk performs many real‑time transformations after ingestion (e.g., rex), whereas ELK encourages preprocessing in Logstash, Fluentd, or rsyslog before indexing; post‑ingest actions can be done with Groovy scripting, though with performance and security trade‑offs.
Q6: ELK retains data for only 7 days—how do you handle historical data? A6: ELK is not designed for long‑term offline analysis; for data older than a week you must export to Hadoop or HDFS for deeper queries.
Log size inflation is typical; the default Logstash template can triple raw size. Our smallest logs average 510 B after compression.
Q7: Do you store ELK data on HDFS? A7: No. ELK and HDFS are separate; we snapshot indices to HDFS or use es_hadoop.jar with Spark to move data, but indices are not built directly on HDFS.
Part 3: ELK Operational Challenges
Q8: What is the biggest challenge in operating ELK? A8: Managing the Master node, which only maintains cluster state, can be stressful when mappings change frequently; schema‑less JSON ingestion can still cause mapping conflicts, leading to write failures.
Q9: How to mitigate rapid index growth? A9: Disable the default _all field if not needed, customize templates to avoid unnecessary sub‑fields, and turn off unused analyzers to reduce bloat.
Q10: Principles for tuning ES memory? A10: Use versions that support doc_values, which reduces memory pressure.
Q11: What is your merge strategy? A11: Default merge settings work well if refresh and flush intervals are tuned; monitor segment merge speed versus indexing rate and adjust indices.store.throttle.max_bytes_per_sec if needed.
Q12: What happens if data nodes have mismatched disk capacities? A12: ES stops allocating shards to a path once it reaches ~85 % usage, preventing overload on smaller disks.
Part 4: Logstash and Kibana Details
Q13: How long does optimizing historical data take? A13: Optimizing a day's worth of data can take more than six hours; it is a slow process.
Q14: How do you decide the number of shards? A14: Perform single‑node, single‑index benchmarks; increase shards when performance degrades. Our largest index uses a single 50 GB shard.
Q15: How to improve insert performance without Logstash? A15: Adjust refresh_interval and bulk size (≈10‑15 MB per bulk), handle 429 errors, and test until a performance breakpoint is reached.
Q16: Which ES nodes does Logstash push logs to? A16: Typically to client nodes; some prefer data nodes for lower latency, but network topology may dictate the choice.
Q17: Can ELK integrate with monitoring systems for alerts? A17: Yes; we use Zabbix with cron jobs that query 200 aggregation endpoints every five minutes and trigger alerts based on response times.
Q18: How to generate function‑call chain visualizations? A18: See the diagram below.
Q19: Where to implement GeoIP for user map visualizations? A19: Parse GeoIP in Logstash, then visualize in Kibana; we also built a custom plugin using MaxMind DB for better performance.
Q20: How to ensure unique index names? A20: Define naming conventions in Logstash/Flume/etc., e.g., name-yyyy.mm.dd.
Q21: What are the fundamental performance limits of ES? A21: Since ES is built on Lucene, some bottlenecks are inherent; upcoming versions aim to reduce memory usage by removing the IndexWriter for historical data.
--- End of Q&A ---
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
