Big Data 10 min read

Design and Architecture of Youzan Unified Log Platform

The article describes the design, components, and implementation details of Youzan's unified log platform, covering log ingestion via rsyslog, Logstash, and Flume, centralized processing with Kafka, real‑time analysis using Storm/Spark, and storage in HDFS, Elasticsearch, and Hawk, while also discussing challenges and future improvements.

Architect
Architect
Architect
Design and Architecture of Youzan Unified Log Platform

1. Introduction

Since its founding, Youzan has experienced rapid growth, generating massive amounts of system and business logs—averaging 11,000 logs per second and peaking at 15,000, totaling about 9 billion logs per day (≈2.4 TB). To enable effective monitoring, maintenance, and optimization, a unified log platform was built to collect, aggregate, and analyze these logs across all services.

2. Overall Design

The platform collects logs from all systems, converts them into streaming data, and forwards them via Flume or Logstash to a Kafka cluster (the log center). Downstream consumers such as Track, Storm, Spark process the streams in real time, while logs are persisted to HDFS for offline analysis, indexed in Elasticsearch for query, or sent to Hawk for alerting and metric monitoring.

3. Module Breakdown

3.1 Log Ingestion Layer

Two ingestion methods are used:

3.1.1 rsyslog + Logstash

Stable logs (e.g., system, Nginx, PHP‑FPM) are written by rsyslog to a local directory (local0); Logstash reads the incremental files and pushes them to the appropriate Kafka topic.

3.1.2 Flume‑NG

Flume‑NG provides a distributed, highly‑available pipeline with an Agent composed of Source, Channel, and Sink. Only the Agent layer is used in Youzan’s platform. Logs are formatted as follows:

<158>yyyy-MM-dd HH:mm:ss host/ip level[pid]: topic=track.**** {"type":"error","tag":"redis connection refused","platform":"java/go/php","level":"info/warn/error","app":"appName","module":"com.youzan.somemodule","detail":"any things you want here"}

PHP and Java services use custom SDKs (PHP SDK and a Logback TrackAppender ) to emit logs in this format, while other languages (Go, Node.js, Python) can assemble logs manually and send them to Flume.

A custom TrackSink was implemented because the built‑in KafkaSink could not route logs to topics as required.

Flume is supervised, uses a fail‑over strategy to write logs locally if the central Kafka cluster is unavailable, ensuring no data loss.

3.2 Log Center

The log center is a Kafka cluster that caches the most recent 24 hours of logs; older logs are flushed to HDFS. Kafka was chosen for its distributed architecture, high throughput (tens of thousands of messages per second on commodity hardware), persistence, topic‑based partitioning, and configurable retention.

3.3 Log Processing and Storage Layers

Processing includes aggregating logs, indexing them in Elasticsearch, generating alerts for abnormal logs, computing metrics, building call‑chain traces, and enabling user behavior analysis. Storage options include HDFS for offline batch processing and Elasticsearch for interactive queries.

4. Issues Encountered and Future Work

Challenges include the need for per‑business log adapters, lack of documentation, high Elasticsearch memory usage, and the desire for a more user‑friendly operations console. Planned improvements cover SDK abstraction for log consumption, better testing environments, memory‑optimized Elasticsearch indices, UDP support in Flume, co‑location of HDFS with the log center, and advanced log mining.

5. Conclusion

The author shares this architecture to help others understand large‑scale log collection and processing, reflecting the practical difficulties of taking over a critical infrastructure component with limited hand‑over time.

big dataElasticsearchkafkaHDFSlog platformflume
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.