How TcpRT Enables Real‑Time Service Quality Monitoring for Massive Cloud Databases
TcpRT is a real‑time instrumentation and diagnostic system for Alibaba Cloud RDS that non‑intrusively collects TCP trace data, aggregates billions of records per day, applies statistical and Cauchy‑based anomaly detection, and pinpoints root causes across hosts, proxies, and network devices at massive scale.
Introduction
TcpRT is a paper accepted by SIGMOD 2018 that presents Alibaba Cloud RDS’s real‑time service‑quality collection and diagnostic system. It innovates in SLA data collection, metric computation, anomaly detection, root‑cause analysis, and large‑scale automated deployment for cloud database customers.
Contributions
Proposes a non‑intrusive, low‑cost kernel‑based method to collect per‑connection latency, bandwidth, and network quality metrics for relational databases, quantifying the impact of network loss and retransmission on service quality.
Develops a streaming ETL system that cleans, filters, aggregates, and analyzes raw trace data with horizontal scalability, fault tolerance, exactly‑once semantics, and interoperability with platforms like EMR and MaxCompute.
Introduces a novel algorithm that detects service‑quality anomalies and locates their root causes.
Problem
Cloud databases are critical for enterprise stability, and rapid detection and diagnosis of performance anomalies is challenging. TcpRT captures TCP/IP congestion‑control trace data to monitor database latency and network anomalies, performing large‑scale real‑time analysis and using Cauchy distribution‑based statistical methods to identify abnormal points and component‑level failure probabilities.
Architecture
The system consists of four main components:
Kernel module – collects raw trace metrics (query latency, proxy and DB connection metrics).
Local aggregator – aggregates trace data locally and pushes it to a Kafka queue.
Streaming ETL – cleans, aggregates, and analyzes time‑series data on a backend streaming platform, separating hot and cold data.
Online anomaly detection – fits anomaly models, performs real‑time event judgment, and uses network relationship graphs to compute component‑level anomaly probabilities.
TcpRT Kernel Module
The module monitors the full lifecycle of TCP connections (receive, handle, response) and calculates:
Upstream time = T1‑T0 Processing time = T2‑T1 Downstream time = T3‑T2 Query time = T3‑T0 RTT time = T2‑T2'
It leverages a modified Linux congestion‑control algorithm to obtain per‑packet ACK context without intrusive instrumentation, ensuring low overhead and high performance (less than 1% load impact in sysbench tests).
TcpRT Aggregator
Aggregates millions of trace records per second, writes aggregated results to /dev/shm, and forwards them via Logagent to Kafka for downstream ETL processing.
TcpRT ETL
The ETL pipeline performs data conversion, association, aggregation, storage, and latency/reordering handling. It uses a “best‑effort aggregation” window to emit partial results while still accepting late‑arriving data, enabling multi‑stage aggregation.
Online Anomaly Monitoring
Traditional threshold‑based detection is brittle; TcpRT adopts adaptive statistical models. It replaces mean and standard deviation with median and MAD for robustness, fits a Cauchy distribution, and uses control‑chart‑like confidence intervals to flag anomalies without manual thresholds.
Host Anomaly Detection
By analyzing consistency trends across all instances on a host, TcpRT computes a trend‑consistency score (r). Significant positive or negative r values indicate host‑level issues such as IO hangs, with additional weighting from host resource metrics.
Proxy Anomaly Detection
Proxy nodes handle thousands of instance requests. TcpRT measures proxy‑relay time (prT) by decomposing end‑to‑end timestamps, calculates the proportion of instances with excessive prT, and applies the same r‑based statistical detection, supplemented by absolute ratio thresholds for chronic issues.
Network Anomaly Detection
TcpRT captures TCP reorder, retransmission, RTT jitter, and RST counts to identify faulty TOR switches. It builds a bipartite graph of proxies and DB nodes, marks edges with abnormal events, and computes a weighted anomaly score (count^1.5/total) for each TOR pair.
Conclusion
TcpRT continuously processes 20 million trace records per second, handles billions of records daily, and detects anomalies within seconds, operating stably for three years. The system will be productized for RDS customers, and further algorithms are being explored to uncover additional abnormal behaviors.
Authors: Ming Song, Jian Chuan, Bing Bao, Zhong Ju, Qian Qing, Wang Lan, Ming Shu.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
