Log Collection and Analysis: Architectures Using Flume, Kafka, Storm, Elasticsearch, and MongoDB
This article discusses various log collection and analysis architectures, comparing solutions such as Flume‑Kafka‑Storm pipelines, Sentry, MongoDB, ELK stack, and Hadoop, and shares practical experiences, advantages, drawbacks, and deployment tips from multiple engineers.
Topic: Log collection and analysis – contributions from several engineers.
1. Flume + Kafka + Storm + MySQL : Multiple web servers send logs via Flume agents using Avro to Kafka; Storm processes the data in real time and stores results in MySQL or HBase. Storm and Kafka share a Zookeeper cluster, and both Flume and Kafka can be load‑balanced across several servers.
2. Sentry for asynchronous error reporting : Important errors are sent to Sentry, which is easy to deploy and supports many languages, including front‑end JavaScript. The newer Sentry version only offers HTTP (no UDP), sending large payloads with environment variables and trace data, making it a small but elegant solution.
3. MongoDB for log storage : Logs are stored in a daily collection, providing convenient querying and easy archiving with an open‑source web UI.
4. Real‑time vs. batch writing : Logs can be written in real time; if real‑time is not required, Hadoop can be used for batch processing, while Storm handles real‑time analysis.
5. MongoDB storage concerns : Large storage size is mitigated by regular archiving; no immediate bottleneck observed.
6. Infobright as a data‑warehouse option : Similar to MySQL but offers better space savings and query performance; suitable for long‑term log storage, though the community edition limits thread count.
7. Sentry vs. UDP : Some prefer UDP for simplicity; less frequent logs can be stored as plain files and grepped, with occasional use of Logstash, Elasticsearch, and Kibana for visualization and statistical analysis.
8. Typical log metrics : Page views, service stability, and other statistics are collected without heavy custom code.
9. ELK stack usage : Elasticsearch stores logs for fast search; Kibana provides visual dashboards; Logstash is optional as other agents can feed logs into Elasticsearch. The stack is widely used for online issue diagnosis, log statistics, and alerting.
10. Alternative pipelines : Some teams use Flume‑Storm‑Hadoop with log4j configuration to forward logs without code changes; others combine Scribe with Storm for real‑time regional PV statistics.
11. Redis in ELK : Redis can serve as a simple log event queue, though Kafka offers higher throughput and persistence features.
12. MongoDB for game logs : Developers appreciate MongoDB’s flexible schema for storing arrays and varied fields, though write performance can be a drawback compared to relational databases.
Overall, the discussion highlights practical experiences, trade‑offs, and tool choices for building scalable, real‑time log collection and analysis systems.
Nightwalker Tech
[Nightwalker Tech] is the tech sharing channel of "Nightwalker", focusing on AI and large model technologies, internet architecture design, high‑performance networking, and server‑side development (Golang, Python, Rust, PHP, C/C++).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.