How Xianyu Built a Sub‑3‑Second Real‑Time Data Pipeline for Rapid Fault Diagnosis
Xianyu’s production environment grew complex, prompting the creation of a high‑performance, sub‑3‑second real‑time data processing pipeline that ingests logs and metrics, uses Alibaba’s Logtail, LogHub, and Blink (enhanced Flink) for collection, transport, pre‑processing, computation, and persistent graph‑based fault analysis.
Background
In Xianyu’s production environment the number of horizontally dependent services and vertically layered runtime environments grew rapidly, making root‑cause identification time‑consuming. When an incident occurred, locating the cause could exceed ten minutes, prompting the design of an automatic, high‑performance real‑time data processing pipeline.
Input and Output Definition
Input consists of:
Service request logs containing traceid, timestamp, client IP, server IP, latency, return code, service name and method name.
Environment monitoring metrics (CPU, JVM GC count, GC time, database metrics, etc.) with metric name, host IP, timestamp and value.
Output is, for a given time window, a directed acyclic graph (DAG) that represents the root cause of errors for each service. The root node is the observed error; leaf nodes are underlying causes such as external service failures or JVM exceptions.
Architecture Overview
Real‑time data flows through five stages analogous to water moving through pipes: collection, transport, pre‑processing, computation, and storage.
Collection uses Alibaba’s self‑developed SLS log service (Logtail + LogHub). Logtail provides high performance, reliability and a flexible plugin mechanism for custom data collection.
Transport relies on LogHub, a stable publish‑subscribe component similar to Kafka, which guarantees ordered, loss‑less delivery.
Pre‑processing is performed with Blink, Alibaba’s internally enhanced Flink version. Blink was chosen over JStorm and Spark‑Streaming for its state management, low latency and SQL support.
After pre‑processing, call‑chain logs are stored in a CEP/graph service and monitoring data in a time‑series database (TSDB). The aggregated call‑chain graph is persisted in Lindorm (an enhanced HBase) for online queries.
Detailed Design and Optimizations
1. Collection
Logtail uses four plugin types: inputs (data sources), processors (transformations), aggregators (grouping) and flushers (sinks). Custom input plugins batch multiple metric requests into a single goroutine, reducing I/O. Processors split the combined JSON array into individual records.
2. Transport
Logtail writes directly to LogHub; Blink consumes the data. The number of LogHub partitions must be at least the Blink parallelism to avoid idle tasks.
3. Pre‑processing
Two parallel streams are created:
Request‑entry logs are filtered for error requests.
Intermediate call‑chain logs are joined with the first stream on traceid to enrich error‑related data.
4. State Lifetime
State is stored in Niagara with a TTL of 60 seconds to balance memory usage and join completeness.
state.backend.type=niagara
state.backend.niagara.ttl.ms=600005. MicroBatch / MiniBatch
MicroBatch and MiniBatch batch records before processing, reducing state accesses and output volume.
blink.miniBatch.join.enabled=true
blink.miniBatch.allowLatencyMs=5000
blink.miniBatch.size=200006. Dynamic Rebalance
Dynamic Rebalance distributes load across sub‑partitions based on buffer occupancy, preventing hotspots.
task.dynamic.rebalance.enabled=true7. Custom Output Plugin
Instead of a message queue, a custom plugin writes aggregated request‑chain data to an RDB with expiration. Only the traceid is sent via MetaQ to downstream services, drastically reducing MetaQ traffic.
8. Graph Aggregation
Redis ZRANK assigns a global sequence number to each node. Each node receives a unique key per time bucket, enabling counting and edge accumulation via Redis sets, which allows reconstruction of the aggregated DAG.
Assign global sequence numbers to nodes.
Encode nodes (head: seq|timestamp|code; normal: |timestamp|code).
Count nodes using Redis keys.
Accumulate edges via Redis sets.
Record the root node to traverse the final graph.
Performance Results
The end‑to‑end latency stays below three seconds, and fault‑location time dropped from over ten minutes to about five seconds, greatly improving operational efficiency.
Future Work
Planned improvements include automatic data reduction/compression, moving complex model calculations into Blink, and adding multi‑tenant data isolation to support larger workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
