Big Data 10 min read

Design of a High-Performance Real-Time Data Processing System for Service Diagnosis

The paper presents a high‑performance real‑time data processing pipeline that collects, transports, preprocesses, and computes service logs and metrics using Alibaba Logtail, LogHub, and an enhanced Flink (Blink) engine, persisting root‑cause graphs in Lindorm, achieving sub‑3‑second latency for tens of millions of events per second and cutting diagnosis time to about five seconds.

Xianyu Technology

Jun 20, 2019

Design of a High-Performance Real-Time Data Processing System for Service Diagnosis

Background: Xianyu's production environment is increasingly complex, making rapid root‑cause diagnosis a challenge.

Requirements: real‑time data collection, analysis, complex computation, persistence; support diverse data (logs, metrics, trace); high reliability; latency <3 s for tens of millions events per second.

Input: service request logs (traceid, timestamps, client/server IPs, latency, status code, service name, method); environment monitoring metrics (CPU, JVM GC count/duration, DB metrics).

Output: root‑cause graph for service errors expressed as a directed acyclic graph.

Architecture: data flow divided into collection, transmission, preprocessing, computation, storage. Uses Alibaba Logtail + LogHub for collection and transport, Blink (enhanced Flink) for stream processing, TSDB for metrics, and Lindorm (enhanced HBase) for graph persistence.

Collection: Logtail with custom plugins (inputs, processors, aggregators, flushers) aggregates multiple metrics in a single goroutine to reduce I/O.

Transmission: LogHub acts as a stable pub/sub channel; partition count matches Blink parallelism.

Preprocessing: Blink performs stateful joins of error request logs with trace logs using two streams and a 1‑minute state TTL.

state.backend.type=niagara
state.backend.niagara.ttl.ms=60000

Performance tweaks: enable micro‑batch/mini‑batch, set latency and size limits, and use Dynamic‑Rebalance to avoid hotspots.

blink.miniBatch.join.enabled=true
blink.miniBatch.allowLatencyMs=5000
blink.miniBatch.size=20000
task.dynamic.rebalance.enabled=true

Custom output plugin writes aggregated trace data to RDB and notifies downstream services via MetaQ, reducing traffic.

Graph aggregation: Redis ZRANK assigns unique IDs, counts nodes, and stores edges in sets to reconstruct the error DAG.

Benefits: end‑to‑end latency under 3 s, diagnosis time reduced from minutes to ~5 s, supporting tens of millions of events per second.

Outlook: scaling to more Alibaba services, further data compression, in‑stream model analysis, multi‑tenant isolation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Flink data processing Real-time Streaming

Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.