Industry Insights 10 min read

How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs

This article details Youzan's evolution from a simple Flume‑based log collector to a multi‑tenant, Kafka‑buffered, Spark‑processed, HBase‑backed logging architecture that now handles hundreds of billions of log entries per day, highlighting challenges, design decisions, and future improvements.

Youzan Coder
Youzan Coder
Youzan Coder
How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs

Overview

Log platform processes hundreds of billions of log entries per day (≈500k logs/s avg, 800k peak). Provides collection, transport, buffering, processing, storage, and query.

Original Architecture (2016)

Logs were collected via Flume or Logstash, sent to Kafka, processed by Track, Storm, Spark, stored in HDFS for offline analysis or Elasticsearch for real‑time search.

Problems

Inconsistent log formats increased parsing and query cost.

Multiple ingestion tools added operational overhead.

Large volume of user logs made error‑spike detection slow.

Elasticsearch default 3 primary + 3 replica shards caused hot nodes, bulk request rejections, and I/O concentration.

Bulk rejections were not retried, leading to data loss.

Seven‑day retention on SSDs became cost‑inefficient.

Lack of physical isolation made OOM in one index affect the whole cluster.

Current Pipeline

The pipeline consists of six stages: collection → transport → buffering → processing → storage → retrieval.

System architecture diagram
System architecture diagram

Log Ingestion

SDK ingestion : Language‑specific SDKs serialize logs into a unified protocol and send them over TCP to the rsyslog‑hub layer.

HTTP ingestion : Services that cannot use an SDK POST logs to a web portal, which forwards the payload to Kafka.

Transport / Collection

rsyslog‑hub and the web portal replace Flume. rsyslog runs on each host, forwards raw logs to a load‑balanced rsyslog‑hub cluster (LVS). The hub parses logs and determines the target Kafka topic.

Buffering

Kafka acts as a high‑throughput, fault‑tolerant buffer, decoupling producers from consumers and smoothing traffic spikes.

Processing

Spark Streaming jobs consume Kafka topics, allocate resources per business log volume via YARN, and write processed records to Elasticsearch. Alert rules defined in a management console are evaluated in‑memory; when a threshold is reached an alert is sent. Bulk request rejections from Elasticsearch are captured and the failed batches are re‑queued to Kafka for later retry.

Storage

Raw logs are persisted in HBase, reducing SSD pressure on Elasticsearch.

Searchable indices are created daily in Elasticsearch; shard count is calculated from historical volume to avoid hot shards.

Retention is limited to seven days; older partitions are archived.

Both HBase and Elasticsearch scale linearly as nodes are added.

Multi‑Tenant Isolation

Each tenant can have dedicated SDKs, optional private Kafka clusters, isolated YARN resource pools, and separate Elasticsearch/HBase deployments, preventing cross‑tenant resource contention.

Open Issues

End‑to‑end monitoring of log loss is missing, making precise alerting difficult.

One Kafka topic per log model with a fixed three‑partition layout causes load imbalance and unnecessary broker resources; the large topic count also increases latency.

The default Chinese IK‑max‑word analyzer is sub‑optimal for predominantly English logs, reducing search relevance.

Future Directions

Planned improvements include richer value extraction from logs, comprehensive loss monitoring, dynamic partitioning strategies for Kafka, and language‑agnostic analyzers to improve search quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsElasticsearchKafkaHBaseLog ManagementSparkYouzan
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.