Design of a High-Performance Real-Time Data Processing System for Service Diagnosis
The paper presents a high‑performance real‑time data processing pipeline that collects, transports, preprocesses, and computes service logs and metrics using Alibaba Logtail, LogHub, and an enhanced Flink (Blink) engine, persisting root‑cause graphs in Lindorm, achieving sub‑3‑second latency for tens of millions of events per second and cutting diagnosis time to about five seconds.
Background: Xianyu's production environment is increasingly complex, making rapid root‑cause diagnosis a challenge.
Requirements: real‑time data collection, analysis, complex computation, persistence; support diverse data (logs, metrics, trace); high reliability; latency <3 s for tens of millions events per second.
Input: service request logs (traceid, timestamps, client/server IPs, latency, status code, service name, method); environment monitoring metrics (CPU, JVM GC count/duration, DB metrics).
Output: root‑cause graph for service errors expressed as a directed acyclic graph.
Architecture: data flow divided into collection, transmission, preprocessing, computation, storage. Uses Alibaba Logtail + LogHub for collection and transport, Blink (enhanced Flink) for stream processing, TSDB for metrics, and Lindorm (enhanced HBase) for graph persistence.
Collection: Logtail with custom plugins (inputs, processors, aggregators, flushers) aggregates multiple metrics in a single goroutine to reduce I/O.
Transmission: LogHub acts as a stable pub/sub channel; partition count matches Blink parallelism.
Preprocessing: Blink performs stateful joins of error request logs with trace logs using two streams and a 1‑minute state TTL.
state.backend.type=niagara
state.backend.niagara.ttl.ms=60000Performance tweaks: enable micro‑batch/mini‑batch, set latency and size limits, and use Dynamic‑Rebalance to avoid hotspots.
blink.miniBatch.join.enabled=true
blink.miniBatch.allowLatencyMs=5000
blink.miniBatch.size=20000
task.dynamic.rebalance.enabled=trueCustom output plugin writes aggregated trace data to RDB and notifies downstream services via MetaQ, reducing traffic.
Graph aggregation: Redis ZRANK assigns unique IDs, counts nodes, and stores edges in sets to reconstruct the error DAG.
Benefits: end‑to‑end latency under 3 s, diagnosis time reduced from minutes to ~5 s, supporting tens of millions of events per second.
Outlook: scaling to more Alibaba services, further data compression, in‑stream model analysis, multi‑tenant isolation.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.