Unlock Real-Time Log Analysis with ELK: From Basics to Advanced Practices
This article explores how the ELK stack can transform large‑scale log processing into fast, flexible, and interactive analysis for troubleshooting, security auditing, and monitoring, sharing practical examples, common pitfalls, and best‑practice recommendations from real‑world deployments at Sina.
ELK Usage Scenarios
This article introduces the ELK suite for log handling, starting with an overview of its components and common use cases.
Why Logs Matter
Problem diagnosis – data‑driven operations.
Security auditing.
Monitoring.
Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides a holistic view of a system’s state to better understand and address failure scenarios.
Effective log analysis must go beyond simply storing logs; it should enable rapid, interactive investigation.
Guest Introduction
Rao Chenlin, system architect at Sina Tech Assurance, Perl programmer, author of "Website Operations Technology and Practice", former at YunKuaiXian and Renren, focuses on CDN and automation, recently researching log processing and monitoring.
Application Examples
Typical application log (image omitted for brevity).
Logstash configuration example:
Kibana 3 dashboard screenshot:
Kibana 4 dashboard screenshot (improved performance, color tweaks):
These examples show how a few dozen Logstash lines can power diverse visualizations such as time‑series histograms and top‑N term charts.
Using ELK for PHP slow‑log analysis:
Resulting Kibana dashboard (clickable host filter):
Interactive filtering lets operators pinpoint problematic hosts and trace slow function calls.
Multi‑dimensional analysis of Nginx error logs:
ELK also supports crash‑log analysis, allowing developers to filter out system functions and focus on application‑specific stack traces.
Adding a version filter refines top‑N results for new releases.
Best Practices
At Sina, the ELK deployment handles 65 billion log entries over seven days across 26 data nodes (42 GB RAM, 2.4 TB SAS, 8‑core CPUs). Key lessons include:
Enable
doc_valuesto pre‑materialize fielddata on disk and avoid memory spikes.
Adjust recovery and relocation settings; default conservative parameters can make a node restart take days.
Disable multicast discovery in public clouds to prevent false‑positive scans.
Control shard allocation per node for newly created daily indices to prevent I/O overload on a single node.
Be aware that Elasticsearch is schema‑less, not “no‑schema”; mismatched field types across indices can corrupt searches, and the default
ignore_above:256may drop long stack‑trace fields.
Recommended Reading
"Elasticsearch Service Development (2nd Edition)"
"Zabbix Monitoring System Deep Dive"
"Log Management and Analysis Authority Guide"
"The Charm of Data: Open‑Source Data Analysis"
"Website Operations: Secrets to Real‑Time Data"
"The Art of Web Capacity Planning"
"Large‑Scale Web Service Development Techniques"
Code as Craft
PerfPlanet Calendar
Kibana Logstash Site
Conclusion
Elasticsearch’s scoring can be used to compare time‑series anomalies, and the newer Watcher plugin adds alerting capabilities. Some users even replace storage systems like GlusterFS with Elasticsearch for image storage and automatic thumbnail generation.
If a newbie has a bad time, it’s a bug. – Jordan Sissel, Logstash author
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.