Design and Implementation of an Integrated Log Collection, Analysis, and Monitoring System
This article describes how a rapidly growing technical team built a unified log system that consolidates program, web access, and slow logs, introduces host‑agent and process‑agent collection, leverages Kafka, Elasticsearch, and Storm for high‑throughput processing, and provides monitoring, alerting, and reporting features to improve reliability and operational efficiency.
1. Background
The dealer business expanded quickly, growing the technical team from 20 to 170 members in three years, which increased the number and complexity of systems. Several pain points emerged: delayed issue detection, inefficient debugging, lack of runtime visibility, and insufficient log monitoring for micro‑services.
2. Solution
A comprehensive log system was identified as the natural solution. The goal was to create a platform that integrates log collection, query analysis, and alerting, improving both debugging efficiency and overall observability.
3. Selection
The team evaluated existing solutions (Kibana, Elasticsearch, commercial SaaS) and decided to develop a custom system prioritizing ease of use and extensibility. While Kibana offers powerful features, its query language has a steep learning curve; a lightweight web UI with form‑based queries better fits internal developers' needs.
4. Results
Daily averages now include 1.3 billion access logs, 180 million program logs, and 3 million slow logs. The system serves about 7 000 daily users (≈50% of the development team), monitors 800 metrics, and processes 380 000 checks per day, effectively addressing the earlier pain points.
5. Log Categories
Program logs (captured via log4j, NLog, etc.)
Web access logs (nginx, IIS, Tomcat)
Program slow logs (method‑level latency)
Call‑trace logs (APM tools such as Zipkin, SkyWalking)
6. Log System Features
Detailed queries for program, slow, and web access logs
Slow‑method analysis with threshold statistics
HTTP service performance and error‑rate analysis
Daily aggregated domain metrics with automated email reports
Site ranking based on static code and runtime indicators
Monitoring and alerting (program, access, slow logs, HTTP calls, custom metrics)
Permission control for departmental log isolation
7. System Architecture Diagram
The left side illustrates log collection (host‑agent and process‑agent), while the right side shows storage (Kafka → Elasticsearch) and consumption (Storm) leading to query and monitoring services.
8. Log Format Design
A unified log schema was defined (23 fields for program logs, IIS‑style fields for web logs) to simplify ingestion, storage, and API development across multiple languages.
9. Log Collection Principles
9.1 Host‑Agent Collection
Agents installed on each host (using nxlog) read text logs and forward them to the server, offering high reliability without code intrusion.
9.2 Process‑Agent Collection
Embedded collectors (log4j‑appender extensions) send logs directly from the application process, providing low‑cost integration at the expense of some reliability.
9.3 Log Types Collected
Program logs via host‑agent
Web access logs: IIS and nginx via host‑agent; Tomcat via a Spring MVC filter and process‑agent
Program slow logs via AOP‑based interception and process‑agent
10. Server‑Side Collection
10.1 Log Reception
Clients connect to a domain name fronted by LVS for load balancing; the service receives logs via Flume (chosen over Logstash for lower resource usage).
10.2 Role of Kafka
Kafka buffers high‑throughput log streams, decoupling producers from Elasticsearch and providing resilience during cluster outages.
10.3 Log Consumption and Ingestion
Storm processes logs in real time, batching writes to Elasticsearch (100 logs per bulk request) and adjusting index refresh intervals to 5 seconds for efficiency.
10.4 Storage Optimization
Indices use the best_compression codec, keyword types for non‑analyzed fields, and are partitioned by department to reduce shard count during queries. Older data is rotated out after a month.
11. Log Applications
The system provides data APIs for backend services and a dedicated monitoring scheduler; these components are outside the core logging technology scope.
12. Experience Summary
Rapid prototyping and iterative development
Focus on user‑driven requirements
Design for sufficiency—balancing simplicity and extensibility
Overall, the project demonstrates how a small team can build a scalable, extensible log platform that improves operational visibility and supports continuous improvement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
