Evolution of Monitoring Architecture and Traffic Alert Algorithms at Tongcheng Travel
This article describes how Tongcheng Travel’s monitoring system evolved from a monolithic design to a distributed and big‑data‑based architecture, introducing real‑time processing with Storm, machine‑learning‑enhanced alerts, and a multivariate linear regression model that dramatically improves traffic anomaly detection accuracy.
On a weekend, Xiao Ming received a warning about a failing third‑party API, quickly isolated the issue, and restored services within three minutes, illustrating the need for rapid incident response.
1. Business Background When the Tongcheng Travel app launched it only offered hotels, scenic spots, and domestic flights. Since the 2014 "ALL IN" wireless strategy, the product line expanded dramatically, reaching over a hundred million downloads and millions of daily active users, causing interface traffic to grow from millions to tens of millions and even billions of requests per day.
Such traffic pressure led to frequent failures in membership, payment, and order services, and engineers struggled to locate and resolve problems quickly, resulting in significant losses. A fast and effective monitoring and alert mechanism became essential for stable operation.
2. Architecture Evolution
2.1 Monolithic Architecture (daily data: millions)
Logs were written directly to a database and a job ran every five minutes to query recent data and decide whether to trigger an alarm. This worked while traffic stayed below a million requests per day, but performance degraded sharply as traffic grew, and the tight coupling with the database hindered scalability.
Figure 1: Monolithic Architecture
2.2 Distributed Architecture (daily data: tens of millions)
RabbitMQ was introduced as a buffer; data passed through middleware to several consumer nodes for real‑time calculation, after which alerts were stored in the database. This decoupled business services from the database and increased processing capacity to 15‑20 k requests per second.
However, the consumer nodes were built with .NET, making upgrades and scaling cumbersome. After six months, traffic surged again, affecting stability and timeliness, prompting a second redesign.
Figure 2: Distributed Architecture
2.3 Big‑Data‑Based Architecture (daily data: billions)
The system incorporated multiple data pipelines (RabbitMQ, Kafka, TurboMQ, Flume) and used Storm for real‑time computation, persisting results in Elasticsearch. This enabled the integration of PC, WeChat, and Touch clients and expanded monitoring dimensions.
Machine‑learning algorithms were later added to improve alert precision.
Figure 3.1: Big‑Data Platform Architecture
Figure 3.2: Storm Internal Processing Flow
3. Traffic Monitoring Algorithm Improvements
3.1 Traditional Approaches
Two common methods are: (a) letting users set static thresholds, which requires expert knowledge and constant manual tuning; and (b) system‑generated thresholds based on day‑over‑day or week‑over‑week comparisons, which struggle with seasonal variations such as holidays.
3.2 Multivariate Linear Regression Model
The new model predicts traffic anomalies using the previous four time points as independent variables. The regression formula is y = β0 + β1·x1 + … + βn·xn + ε.
Key steps include:
① Sample selection – typically the past 20 days.
② Sample filtering – remove historical anomalies and extreme values.
③ Compute related factors – calculate min, max, average, etc., for each of the 144 daily points.
④ Data validation and adjustment.
⑤ Results – the new regression model generated 35 alerts with zero false alarms, whereas the traditional method produced 34 alerts with 13 false alarms and 18 missed alerts.
Figure 3.2.5: Comparison of New and Old Traffic Prediction Models
⑥ Case study – on 20 Dec 2016 the new system accurately flagged traffic spikes across multiple iPhone projects (e.g., homepage traffic rose 195 % above expectation), while the old system missed the anomaly due to outdated thresholds.
4. Future Development Trends
The unified monitoring platform will continue to enhance automation and intelligence, simplifying data integration for more projects. Beyond the current multivariate regression, additional intelligent algorithms will be incorporated to support diverse monitoring scenarios.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.