Big Data 12 min read

Ctrip Wireless APM Platform: Architecture, Metrics, and Technical Details

The article describes the evolution of Ctrip's wireless APM platform from the early UBT-based monitoring to a globally‑oriented, metric‑rich system that processes over 100 billion data points daily using Storm and Elasticsearch, detailing its design, key performance dimensions, data‑volume trade‑offs, and implementation choices.

Ctrip Technology

Mar 8, 2018

Ctrip Wireless APM Platform: Architecture, Metrics, and Technical Details

Author Introduction Chen Haoran, Senior Director of Wireless Technology at Ctrip, leads the wireless committee and core engineering team. Liu Lifeng, Director of Wireless Infrastructure Services and Platform at Ctrip.

Ctrip Wireless APM History

The original wireless APM platform was built on the UBT (User Behaviour Tracking) system, initially designed for user‑behavior collection with a separate network channel. Performance data was later separated from behavior data and uploaded in real time.

Early performance monitoring focused on network metrics per service (see Figure 1) and aggregated hourly network performance (success rate, average latency, latency distribution) per version and service (Figure 2). Additional metrics such as request/response size, error‑type distribution, and detailed latency breakdown (Figures 3‑5) supported service‑level optimization.

Later, monitoring expanded to include location, startup, and page‑load metrics, providing comprehensive data for performance tuning and daily operations.

New APM Platform

In the second half of 2017, Ctrip’s rapid international expansion required a globally‑oriented APM solution. Inspired by commercial APM products, a new platform was built to cover eight core performance categories: network performance, crashes, startup/load, location, image, CRN, IM, and VoIP (Figure 6). The basic dimensions are system platform and app version.

Network Performance

Network performance is the top priority and is monitored from multiple angles:

Basic network performance – end‑to‑end success rate, average latency, and traffic volume, with breakdowns by hot countries and cities (Figure 7).

Composite analysis – combinations of country, city, carrier, and access method to detect multi‑dimensional issues (Figure 8).

Network entry performance – aggregated metrics for different Ctrip network entry points worldwide (Figure 9).

Global network diagnostics – on‑device probing (DNS, TCP, SSL handshake, traceroute, ping) for performance diagnosis.

Crash Monitoring

Crash data collection and analysis are mature in the industry; the platform captures crash events and provides detailed statistics (Figure 10).

Startup & Load

Monitors cold‑start time, first‑launch time after Android installation, and Android Bundle load time.

Location

Tracks latitude/longitude and city‑level location success rate, average latency, and request volume.

Image

Measures image download success rate, average latency, and download volume, with geographic breakdowns similar to network performance.

CRN

Monitors RN module load count and average latency for components built with Ctrip’s CRN framework.

IM and VoIP

Tracks success rate, average latency, and request volume for instant‑messaging and VoIP calls, also providing country‑ and city‑level breakdowns.

Overall, the new APM platform emphasizes global dimensions and core app‑experience metrics; after launch it has already identified performance hotspots in specific regions.

APM Technical Details

The platform currently supports over 100 + metric types, processes more than 100 billion data points daily, and writes over 100 GB to Elasticsearch each day. The processing pipeline (Figure 11) consists of:

App client data collection.

Storm stream processing for noise filtering and aggregation.

Computation and further aggregation.

Storage in Elasticsearch.

Dashboard visualization of multi‑dimensional data.

Technical Choices

Although InfluxDB offers higher write speed and query performance for time‑series data, Ctrip chose Elasticsearch due to its mature operational ecosystem, complex query capabilities, and ability to handle the massive data volume.

Trade‑offs at Large Scale

With multiple dimensions (client version, platform, network type, country, city, etc.), the combinatorial explosion can reach 16 million distinct series, resulting in ~300 k writes per second. To mitigate pressure, two strategies are used:

Dimensional reduction – dropping less‑important dimensions for certain metrics.

Partial sampling – monitoring only the top 20 % of hot cities based on the 80/20 rule.

Noise Filtering

Outlier data (e.g., HTTP request times >30 s due to client timeout) is filtered before ingestion into Elasticsearch, with configurable rules applied during Storm extraction.

Conclusion

The platform must evolve with business priorities; the authors hope the APM design and implementation can serve as a reference for others and invite further discussion.

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.