Operations 20 min read

Design and Implementation of an Integrated Log Collection, Analysis, and Monitoring System

This article describes how a rapidly growing technical team built a unified log system that consolidates program, web access, and slow logs, introduces host‑agent and process‑agent collection, leverages Kafka, Elasticsearch, and Storm for high‑throughput processing, and provides monitoring, alerting, and reporting features to improve reliability and operational efficiency.

HomeTech

Sep 25, 2018

Design and Implementation of an Integrated Log Collection, Analysis, and Monitoring System

1. Background

The dealer business expanded quickly, growing the technical team from 20 to 170 members in three years, which increased the number and complexity of systems. Several pain points emerged: delayed issue detection, inefficient debugging, lack of runtime visibility, and insufficient log monitoring for micro‑services.

2. Solution

A comprehensive log system was identified as the natural solution. The goal was to create a platform that integrates log collection, query analysis, and alerting, improving both debugging efficiency and overall observability.

3. Selection

The team evaluated existing solutions (Kibana, Elasticsearch, commercial SaaS) and decided to develop a custom system prioritizing ease of use and extensibility. While Kibana offers powerful features, its query language has a steep learning curve; a lightweight web UI with form‑based queries better fits internal developers' needs.

4. Results

Daily averages now include 1.3 billion access logs, 180 million program logs, and 3 million slow logs. The system serves about 7 000 daily users (≈50% of the development team), monitors 800 metrics, and processes 380 000 checks per day, effectively addressing the earlier pain points.

5. Log Categories

Program logs (captured via log4j, NLog, etc.)

Web access logs (nginx, IIS, Tomcat)

Program slow logs (method‑level latency)

Call‑trace logs (APM tools such as Zipkin, SkyWalking)

6. Log System Features

Detailed queries for program, slow, and web access logs

Slow‑method analysis with threshold statistics

HTTP service performance and error‑rate analysis

Daily aggregated domain metrics with automated email reports

Site ranking based on static code and runtime indicators

Monitoring and alerting (program, access, slow logs, HTTP calls, custom metrics)

Permission control for departmental log isolation

7. System Architecture Diagram

The left side illustrates log collection (host‑agent and process‑agent), while the right side shows storage (Kafka → Elasticsearch) and consumption (Storm) leading to query and monitoring services.

8. Log Format Design

A unified log schema was defined (23 fields for program logs, IIS‑style fields for web logs) to simplify ingestion, storage, and API development across multiple languages.

9. Log Collection Principles

9.1 Host‑Agent Collection

Agents installed on each host (using nxlog) read text logs and forward them to the server, offering high reliability without code intrusion.

9.2 Process‑Agent Collection

Embedded collectors (log4j‑appender extensions) send logs directly from the application process, providing low‑cost integration at the expense of some reliability.

9.3 Log Types Collected

Program logs via host‑agent

Web access logs: IIS and nginx via host‑agent; Tomcat via a Spring MVC filter and process‑agent

Program slow logs via AOP‑based interception and process‑agent

10. Server‑Side Collection

10.1 Log Reception

Clients connect to a domain name fronted by LVS for load balancing; the service receives logs via Flume (chosen over Logstash for lower resource usage).

10.2 Role of Kafka

Kafka buffers high‑throughput log streams, decoupling producers from Elasticsearch and providing resilience during cluster outages.

10.3 Log Consumption and Ingestion

Storm processes logs in real time, batching writes to Elasticsearch (100 logs per bulk request) and adjusting index refresh intervals to 5 seconds for efficiency.

10.4 Storage Optimization

Indices use the best_compression codec, keyword types for non‑analyzed fields, and are partitioned by department to reduce shard count during queries. Older data is rotated out after a month.

11. Log Applications

The system provides data APIs for backend services and a dedicated monitoring scheduler; these components are outside the core logging technology scope.

12. Experience Summary

Rapid prototyping and iterative development

Focus on user‑driven requirements

Design for sufficiency—balancing simplicity and extensibility

Overall, the project demonstrates how a small team can build a scalable, extensible log platform that improves operational visibility and supports continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture Big Data Elasticsearch log management

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.