How DBSCAN Clustering and Bayesian Inference Enable Fast Root‑Cause Detection in Securities Trading Systems
This article details the challenges of root‑cause identification in high‑availability securities trading platforms and presents two intelligent‑operations solutions—DBSCAN‑based clustering and Bayesian inference—to quickly locate anomalies and improve recovery efficiency.
The securities industry demands extreme continuity and stability, making rapid identification of transaction system anomalies crucial for preserving customer experience. Traditional root‑cause methods struggle due to multi‑to‑multi machine architectures and the explosion of alerts from modern monitoring tools.
1. Emergence of Intelligent Operations
As fintech, product complexity, and regulatory pressure increase, IT infrastructures become more intricate, with growing machine counts and tighter inter‑system dependencies. Human‑centric, rule‑based operations can no longer meet the uninterrupted service requirements, prompting a shift toward data‑driven, AI‑enhanced operational strategies.
Since 2015, CICC Wealth Securities has built an intelligent‑operations framework based on big data and AI, covering pre‑prediction, real‑time detection, and post‑analysis. This framework improves system safety, reduces manual repetitive tasks, and creates a closed‑loop that links business needs, monitoring evolution, algorithm deployment, and automation.
2. Intelligent Root‑Cause Detection Approaches
Two complementary methods were developed to address the main pain points:
DBSCAN‑Based Clustering : When an anomaly is detected, the system clusters the time‑series metrics of the affected front‑end machine with all potentially related back‑end machines. The dense clusters reveal the most likely root cause, handling dynamic many‑to‑many relationships where traditional link tracing fails.
Bayesian Inference : Alerts generated by monitoring tools are numerous and often unrelated to the actual failure. By computing the conditional probability P(Y|X) = P(X|Y) * P(Y) / P(X) —where Y is an anomaly event and X an alert—the method ranks alerts by their likelihood of causing the anomaly, filtering out noise.
3. DBSCAN Clustering Solution
The solution consists of four functions: anomaly detection, alert compression, feature processing, and clustering. Using a mobile‑trading channel (A‑system) as a case study, the following steps were implemented:
Baseline detection with parameters M=5 minutes and N=3 violations to define an anomaly.
Configuration‑change correction algorithm: compute the baseline mean (BLVS), derive the difference series (DSVS) between real values and baseline, filter anomalies to obtain a corrected series (ADVS), and adjust metrics by the bias.
Sliding T‑test to filter baseline‑drift false positives, reducing alerts by over 10%.
For distance measurement between metric series, three metrics were evaluated: Euclidean distance, Pearson correlation‑based distance, and Spearman correlation‑based distance. Pearson and Spearman proved most accurate for detecting synchronous changes between front‑end and back‑end response times.
DBSCAN clustering, which does not require a preset number of clusters, was applied. The workflow (see Fig. 2.4) clusters metric series, identifies the nearest back‑end machines, and pinpoints the root cause.
In a back‑test using May–June 2020 data, 51 response‑time anomalies were detected, and 26 root causes were successfully identified. One example showed three back‑end machines flagged as the cause of a front‑end latency spike.
4. Bayesian Inference Solution
Some anomalies manifest only as textual alerts rather than metric deviations. To associate alerts with anomalies, the Bayesian method calculates the relative probability P(X|Y) / P(X) for each alert X within five minutes before an anomaly Y. The steps are:
Count alerts occurring within five minutes before each historical anomaly to obtain P(X|Y).
Estimate the overall occurrence probability P(X) for each alert.
Compute the relative probability and rank alerts.
Applying this to A‑system data from February–June 2020, the top three alerts most likely to cause a front‑end latency anomaly were identified, while the least likely alert was also listed, demonstrating clear separation in relevance.
5. Conclusions and Outlook
CICC Wealth Securities pioneered big‑data‑driven automated operations in the securities sector, achieving early adoption of intelligent root‑cause analysis. While the presented clustering and Bayesian methods improve detection speed and accuracy, limitations remain in success rates and scalability. Future work will explore knowledge‑graph‑based and natural‑language‑processing approaches to further enhance root‑cause identification.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
