Big Data 13 min read

How Xianyu Built a Sub‑3‑Second Real‑Time Data Pipeline for Rapid Fault Diagnosis

Xianyu’s production environment grew complex, prompting the creation of a high‑performance, sub‑3‑second real‑time data processing pipeline that ingests logs and metrics, uses Alibaba’s Logtail, LogHub, and Blink (enhanced Flink) for collection, transport, pre‑processing, computation, and persistent graph‑based fault analysis.

dbaplus Community

Aug 13, 2019

How Xianyu Built a Sub‑3‑Second Real‑Time Data Pipeline for Rapid Fault Diagnosis

Background

In Xianyu’s production environment the number of horizontally dependent services and vertically layered runtime environments grew rapidly, making root‑cause identification time‑consuming. When an incident occurred, locating the cause could exceed ten minutes, prompting the design of an automatic, high‑performance real‑time data processing pipeline.

Input and Output Definition

Input consists of:

Service request logs containing traceid, timestamp, client IP, server IP, latency, return code, service name and method name.

Environment monitoring metrics (CPU, JVM GC count, GC time, database metrics, etc.) with metric name, host IP, timestamp and value.

Output is, for a given time window, a directed acyclic graph (DAG) that represents the root cause of errors for each service. The root node is the observed error; leaf nodes are underlying causes such as external service failures or JVM exceptions.

Architecture Overview

Real‑time data flows through five stages analogous to water moving through pipes: collection, transport, pre‑processing, computation, and storage.

Collection uses Alibaba’s self‑developed SLS log service (Logtail + LogHub). Logtail provides high performance, reliability and a flexible plugin mechanism for custom data collection.

Transport relies on LogHub, a stable publish‑subscribe component similar to Kafka, which guarantees ordered, loss‑less delivery.

Pre‑processing is performed with Blink, Alibaba’s internally enhanced Flink version. Blink was chosen over JStorm and Spark‑Streaming for its state management, low latency and SQL support.

After pre‑processing, call‑chain logs are stored in a CEP/graph service and monitoring data in a time‑series database (TSDB). The aggregated call‑chain graph is persisted in Lindorm (an enhanced HBase) for online queries.

Detailed Design and Optimizations

1. Collection

Logtail uses four plugin types: inputs (data sources), processors (transformations), aggregators (grouping) and flushers (sinks). Custom input plugins batch multiple metric requests into a single goroutine, reducing I/O. Processors split the combined JSON array into individual records.

2. Transport

Logtail writes directly to LogHub; Blink consumes the data. The number of LogHub partitions must be at least the Blink parallelism to avoid idle tasks.

3. Pre‑processing

Two parallel streams are created:

Request‑entry logs are filtered for error requests.

Intermediate call‑chain logs are joined with the first stream on traceid to enrich error‑related data.

4. State Lifetime

State is stored in Niagara with a TTL of 60 seconds to balance memory usage and join completeness.

state.backend.type=niagara
state.backend.niagara.ttl.ms=60000

5. MicroBatch / MiniBatch

MicroBatch and MiniBatch batch records before processing, reducing state accesses and output volume.

blink.miniBatch.join.enabled=true
blink.miniBatch.allowLatencyMs=5000
blink.miniBatch.size=20000

6. Dynamic Rebalance

Dynamic Rebalance distributes load across sub‑partitions based on buffer occupancy, preventing hotspots.

task.dynamic.rebalance.enabled=true

7. Custom Output Plugin

Instead of a message queue, a custom plugin writes aggregated request‑chain data to an RDB with expiration. Only the traceid is sent via MetaQ to downstream services, drastically reducing MetaQ traffic.

8. Graph Aggregation

Redis ZRANK assigns a global sequence number to each node. Each node receives a unique key per time bucket, enabling counting and edge accumulation via Redis sets, which allows reconstruction of the aggregated DAG.

Assign global sequence numbers to nodes.

Encode nodes (head: seq|timestamp|code; normal: |timestamp|code).

Count nodes using Redis keys.

Accumulate edges via Redis sets.

Record the root node to traverse the final graph.

Performance Results

The end‑to‑end latency stays below three seconds, and fault‑location time dropped from over ten minutes to about five seconds, greatly improving operational efficiency.

Future Work

Planned improvements include automatic data reduction/compression, moving complex model calculations into Blink, and adding multi‑tenant data isolation to support larger workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization blink Fault diagnosis logtail

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.