Design and Optimization of Large‑Scale Log Systems
This article examines the challenges of handling massive log data in high‑traffic e‑commerce platforms and presents a comprehensive architecture, optimization strategies, and practical implementations—including Rsyslog, Kafka, Fluentd, and the ELK stack—to improve scalability, performance, and reliability of log management systems.
Log data is one of the most common forms of massive data; during events such as Double‑11 sales, an e‑commerce platform can generate billions of log entries per hour, creating severe challenges for technical teams.
The article first outlines a baseline log system architecture, comparing various designs and describing the need for horizontal and vertical scaling, clustering, data partitioning, and rewriting data pipelines to meet business requirements.
It then details four major optimization directions:
Basic optimization : memory allocation, garbage collection, caching, locking; network serialization, compression, protocol selection; CPU utilization via multithreading; disk management through file merging and service pruning.
Platform scaling : additive/subtractive approaches, vertical expansion (adding disk and memory), horizontal expansion using distributed clusters.
Data partitioning : classifying logs by dimension (error, info, debug), handling data hotspots by separating high‑traffic periods, and applying tiered storage.
System degradation : defining downgrade strategies to disable non‑essential functions during overload.
The practical implementation adopts an open‑source ELK stack: Elasticsearch for centralized storage and search, Beats (FileBeat, TopBeat) for lightweight log collection, and Kibana for visualization. Additional components include Rsyslog, Kafka, Fluentd, and HBase for persistent storage.
Key problems encountered were low CPU utilization on transmission servers, frequent full GC in Ruby processes, storage spikes, high‑water‑mark triggers, and cluster hangs when a node fails. Solutions involved adjusting Kafka topics, optimizing Fluentd host polling, increasing per‑core throughput, and implementing dynamic configuration for Elasticsearch.
Two major refactorings are described:
Storage reduction : shortening retention to one day, offloading older data to Hadoop, and keeping only recent data in Elasticsearch.
Data partitioning : removing Kafka in small clusters, assigning dedicated tags to applications, using Fluentd for direct ES ingestion, and compressing data in HBase.
Optimization results include a 15% reduction in server storage usage, per‑core processing increasing from 3 000 to 15‑18 000 logs per second, fewer Elasticsearch protection triggers, and extended data retention beyond the original seven‑day limit.
The article concludes with best practices: storing low‑frequency log data in cheap storage, using sequential disk writes, leveraging SSDs for Elasticsearch, and establishing standardized log formats (UUID, timestamp, host) to facilitate tracing and analysis.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.