How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems
This article describes how a Chinese securities firm applied big‑data‑driven clustering and Bayesian methods to automate root‑cause analysis of trading‑system anomalies, detailing the challenges, algorithmic designs, practical implementations, and evaluation results that demonstrate significant reductions in false alarms and faster recovery.
Problem Context
In high‑frequency trading systems, a spike in front‑end response time may be caused by one or more backend machines. The multi‑to‑many machine topology makes traditional link‑based tracing ineffective, and the rapid growth of alarm volume makes it difficult to associate alarms with true anomalies.
Clustering‑Based Root‑Cause Method
The method consists of four stages: anomaly detection, configuration‑change bias correction, sliding‑window T‑test filtering, and DBSCAN clustering of metric time‑series.
Anomaly Detection and Baseline
Metrics are compared against a baseline built from historical data. An anomaly is declared when the metric exceeds the baseline N times within a sliding window of M minutes. In the experiments M=5 and N=3 were used.
Configuration‑Change Bias Correction
To mitigate false alarms caused by configuration changes, the algorithm computes the difference between the baseline median series and the observed series over the last 30 minutes, filters the difference, and derives a bias value (BIAS) that is subtracted from the raw metric.
BLVS = median baseline over last 30 min
DSVS = observed – BLVS
FDSV = filter(DSVS)
BIAS = mean(FDSV)
AdjustedMetric = observed – BIASThis correction removes more than 98 % of alarms caused by baseline drift.
Sliding‑Window T‑Test
A sliding T‑test is applied to the adjusted metric series to detect abrupt changes. If no significant change is found, the event is discarded as a non‑anomalous deviation, reducing an additional ~10 % of alarms.
Distance Metrics for Time‑Series
Three distance measures were evaluated for clustering:
Euclidean distance: \(d_{E}(A,B)=\sqrt{\sum_{j=1}^{n}(A_j-B_j)^2}\) Pearson‑correlation‑based distance: \(d_{P}(A,B)=1-\rho_{P}(A,B)\) Spearman‑correlation‑based distance: \(d_{S}(A,B)=1-\rho_{S}(A,B)\) Pearson‑based distance performed best for the front‑end vs. backend relationship.
DBSCAN Clustering
Using the chosen distance metric, DBSCAN (density‑based spatial clustering) groups the anomalous front‑end series with all backend series whose distances fall within the epsilon neighbourhood. DBSCAN does not require a predefined number of clusters, which fits the dynamic topology of the trading system.
Practical Results
Data from a mobile trading channel (referred to as system A) during May‑June 2020 were processed. 51 response‑time spikes were detected; 26 root causes were successfully identified. In one case, clustering flagged three backend machines as the cause of a front‑end spike, confirming the method’s effectiveness.
Bayesian Inference‑Based Root‑Cause Method
Some anomalies are signaled only by textual alerts rather than metric deviations. A Bayesian model ranks alerts by their conditional probability of causing an anomaly.
Mathematical Formulation
P(Y|X) = P(X|Y) * P(Y) / P(X)
# Since P(Y) is constant for a given anomaly, ranking can use the relative score:
Score(X) ∝ P(X|Y) / P(X)Implementation Steps
For each historical anomaly Y, count alerts X that occurred within the preceding five minutes to estimate P(X|Y).
Estimate the overall occurrence probability P(X) from the full alert log.
Compute the relative score P(X|Y)/P(X); a higher score indicates a stronger association with the anomaly.
For a new anomaly, rank all alerts observed in the preceding five minutes by this score.
Practical Evaluation
Using A‑system data from February‑June 2020, the most frequent front‑end response‑time anomaly was analyzed. The top three alerts with the highest scores were identified as the most likely causes, while the lowest‑scoring alert was deemed unlikely. The score distribution showed clear separation, demonstrating the utility of the ranking for recovery actions.
Conclusion
The clustering‑based approach, combined with configuration‑change bias correction and sliding‑T‑test, substantially reduces false alarms and enables rapid identification of backend machines responsible for front‑end latency spikes. The Bayesian inference method provides a quantitative ranking of textual alerts, helping operators focus on the most relevant alarms. Both methods have been validated on real trading‑system data, but further research (e.g., knowledge‑graph and NLP‑based techniques) is planned to improve root‑cause success rates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
