Operations 17 min read

Design and Optimization of Large‑Scale Log Systems

This article examines the challenges of handling massive log data in high‑traffic e‑commerce platforms and presents a comprehensive architecture, optimization strategies, and practical implementations—including Rsyslog, Kafka, Fluentd, and the ELK stack—to improve scalability, performance, and reliability of log management systems.

Architecture Digest

Jun 18, 2018

Design and Optimization of Large‑Scale Log Systems

Log data is one of the most common forms of massive data; during events such as Double‑11 sales, an e‑commerce platform can generate billions of log entries per hour, creating severe challenges for technical teams.

The article first outlines a baseline log system architecture, comparing various designs and describing the need for horizontal and vertical scaling, clustering, data partitioning, and rewriting data pipelines to meet business requirements.

It then details four major optimization directions:

Basic optimization : memory allocation, garbage collection, caching, locking; network serialization, compression, protocol selection; CPU utilization via multithreading; disk management through file merging and service pruning.

Platform scaling : additive/subtractive approaches, vertical expansion (adding disk and memory), horizontal expansion using distributed clusters.

Data partitioning : classifying logs by dimension (error, info, debug), handling data hotspots by separating high‑traffic periods, and applying tiered storage.

System degradation : defining downgrade strategies to disable non‑essential functions during overload.

The practical implementation adopts an open‑source ELK stack: Elasticsearch for centralized storage and search, Beats (FileBeat, TopBeat) for lightweight log collection, and Kibana for visualization. Additional components include Rsyslog, Kafka, Fluentd, and HBase for persistent storage.

Key problems encountered were low CPU utilization on transmission servers, frequent full GC in Ruby processes, storage spikes, high‑water‑mark triggers, and cluster hangs when a node fails. Solutions involved adjusting Kafka topics, optimizing Fluentd host polling, increasing per‑core throughput, and implementing dynamic configuration for Elasticsearch.

Two major refactorings are described:

Storage reduction : shortening retention to one day, offloading older data to Hadoop, and keeping only recent data in Elasticsearch.

Data partitioning : removing Kafka in small clusters, assigning dedicated tags to applications, using Fluentd for direct ES ingestion, and compressing data in HBase.

Optimization results include a 15% reduction in server storage usage, per‑core processing increasing from 3 000 to 15‑18 000 logs per second, fewer Elasticsearch protection triggers, and extended data retention beyond the original seven‑day limit.

The article concludes with best practices: storing low‑frequency log data in cheap storage, using sequential disk writes, leveraging SSDs for Elasticsearch, and establishing standardized log formats (UUID, timestamp, host) to facilitate tracing and analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data kafka Scalable Architecture ELK log management Fluentd

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.