How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs
This article details Youzan's evolution from a simple Flume‑based log collector to a multi‑tenant, Kafka‑buffered, Spark‑processed, HBase‑backed logging architecture that now handles hundreds of billions of log entries per day, highlighting challenges, design decisions, and future improvements.
Overview
Log platform processes hundreds of billions of log entries per day (≈500k logs/s avg, 800k peak). Provides collection, transport, buffering, processing, storage, and query.
Original Architecture (2016)
Logs were collected via Flume or Logstash, sent to Kafka, processed by Track, Storm, Spark, stored in HDFS for offline analysis or Elasticsearch for real‑time search.
Problems
Inconsistent log formats increased parsing and query cost.
Multiple ingestion tools added operational overhead.
Large volume of user logs made error‑spike detection slow.
Elasticsearch default 3 primary + 3 replica shards caused hot nodes, bulk request rejections, and I/O concentration.
Bulk rejections were not retried, leading to data loss.
Seven‑day retention on SSDs became cost‑inefficient.
Lack of physical isolation made OOM in one index affect the whole cluster.
Current Pipeline
The pipeline consists of six stages: collection → transport → buffering → processing → storage → retrieval.
Log Ingestion
SDK ingestion : Language‑specific SDKs serialize logs into a unified protocol and send them over TCP to the rsyslog‑hub layer.
HTTP ingestion : Services that cannot use an SDK POST logs to a web portal, which forwards the payload to Kafka.
Transport / Collection
rsyslog‑hub and the web portal replace Flume. rsyslog runs on each host, forwards raw logs to a load‑balanced rsyslog‑hub cluster (LVS). The hub parses logs and determines the target Kafka topic.
Buffering
Kafka acts as a high‑throughput, fault‑tolerant buffer, decoupling producers from consumers and smoothing traffic spikes.
Processing
Spark Streaming jobs consume Kafka topics, allocate resources per business log volume via YARN, and write processed records to Elasticsearch. Alert rules defined in a management console are evaluated in‑memory; when a threshold is reached an alert is sent. Bulk request rejections from Elasticsearch are captured and the failed batches are re‑queued to Kafka for later retry.
Storage
Raw logs are persisted in HBase, reducing SSD pressure on Elasticsearch.
Searchable indices are created daily in Elasticsearch; shard count is calculated from historical volume to avoid hot shards.
Retention is limited to seven days; older partitions are archived.
Both HBase and Elasticsearch scale linearly as nodes are added.
Multi‑Tenant Isolation
Each tenant can have dedicated SDKs, optional private Kafka clusters, isolated YARN resource pools, and separate Elasticsearch/HBase deployments, preventing cross‑tenant resource contention.
Open Issues
End‑to‑end monitoring of log loss is missing, making precise alerting difficult.
One Kafka topic per log model with a fixed three‑partition layout causes load imbalance and unnecessary broker resources; the large topic count also increases latency.
The default Chinese IK‑max‑word analyzer is sub‑optimal for predominantly English logs, reducing search relevance.
Future Directions
Planned improvements include richer value extraction from logs, comprehensive loss monitoring, dynamic partitioning strategies for Kafka, and language‑agnostic analyzers to improve search quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
