Mobile Development 20 min read

Design and Implementation of an Application Performance Management (APM) System for Mobile Apps

This article introduces the background, core features, network and page performance modules, error monitoring, reporting, and optimization practices of a mobile App Performance Management (APM) system, highlighting its impact on monitoring, diagnostics, and performance improvement for large‑scale applications.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Implementation of an Application Performance Management (APM) System for Mobile Apps

1. Background Introduction

APM (Application Performance Management) is essential for monitoring performance metrics of a mature App. The previous 1.0 version suffered from weak filtering, lack of daily reports and alerts, mixed functionalities, and single‑App support. A 2.0 reconstruction was initiated, redefining APM’s positioning and required functions.

The final APM scope includes Data Reports, Performance Daily Reports, Monitoring & Alerts, and a Troubleshooting Entry.

Data Reports: precise end‑to‑end data, multi‑dimensional filtering, multi‑App support, real‑time reports.

Performance Daily Reports: core metric email reports, subscription support.

Monitoring & Alerts: core metrics alerts, Crash rate, JSError alerts.

Troubleshooting Entry: multi‑dimensional abnormal data sampling, integration with internal systems.

The functional modules cover Network Performance, Page Performance, Crash & Jank, and Specialized Performance.

2. Main Features

2.1 Network Performance

Network request performance is a core metric; the APM platform uses end‑to‑end data to reflect real user experience.

2.1.1 Network Architecture Model

Dynamic network requests in the App are sent through a self‑developed network communication framework to a backend Gateway, which forwards them to actual business servers.

Key points:

The communication framework and Gateway use TCP long connections.

Data protocol is a custom contract format.

The framework can convert HTTP to the custom protocol for proxy forwarding.

Gateway handles minimal business logic, focusing on link management and routing; it is globally deployed for better user experience.

All dynamic requests pass through the framework, enabling precise sampling for APM reports.

2.1.2 Error Monitoring Dimensions

Most network request anomalies occur during link establishment and data transmission. Custom TCP link management allows defining error codes for each failure point.

Code

Definition

Description

-202

Request serialization failure

Very rare occurrence

-203

No available link

Common network instability

-204

Send request failure

Socket send failure, very rare

-205

Read response error

Link abnormality, cannot read expected response header length

-206

Response deserialization failure

Very rare

-212

Link abnormal disconnection

Includes client network issues and backend‑initiated disconnects

-213

Unable to read response

Common timeout

Standard HTTP 4xx/5xx errors are also recorded, but the above codes focus on custom TCP link anomalies.

2.1.3 Monitoring Dimensions & Goals

Two primary dimensions: request success rate and request latency.

Success Rate = successful requests / (successful + failed requests) User interaction target: ≥99% Overall average target: ≥98% (including launch/background scenarios)

Latency = time from client request start to response receipt and deserialization User interaction target: backend processing time + 300 ms (RTT)

2.1.4 APM Reports

Sample screenshots of network performance reports:

Version‑wise performance overview with clickable filters.

Service‑specific metrics: success rate, total latency, server processing latency, sample size, error code distribution.

Each table includes a sampling button to select error‑prone or high‑latency device IDs, linking to an internal troubleshooting system for detailed analysis.

Internal troubleshooting tool can order all requests of a link ID, greatly improving network issue resolution efficiency.

2.1.5 Performance Optimization Practices

Custom communication protocol for full link control and easier debugging.

Intelligent ServerIP selection: same‑carrier IP for domestic users, overseas servers for overseas users, timezone‑based selection for startup.

Reasonable timeout settings: avoid overly short timeouts (e.g., 3 s) that hurt success rate.

Enable retries for idempotent services to boost success rate.

2.2 Page Performance

2.2.1 Page Performance Statistics Scheme

Traditional approach records a start timestamp at page initialization and an end timestamp after all required services finish, calculating the interactive time. This method is accurate but requires per‑page instrumentation and maintenance.

To reduce developer burden, a framework‑level solution was sought to automatically measure Time To Interactive (TTI).

Initial idea: capture a screenshot, trim header/footer, divide the middle area into six blocks, randomly sample pixels, and compare similarity.

Problems discovered:

Significant performance overhead (~100 ms per screenshot).

Skeleton screens were mistakenly treated as fully rendered.

Random sampling caused variability on simple pages.

Because of these issues, the solution was not widely adopted.

Observation: page rendering completion is usually accompanied by visible text, and text scanning incurs lower overhead than image capture.

New approach replaces screenshot‑pixel detection with component traversal and text detection.

Start page initialization, traverse all elements, detect text.

If text is within top 20% or bottom 25% of the page, ignore; need at least two text groups to consider detection successful.

If not successful, wait 50 ms and retry; total timeout 10 s.

Demo video (Native/CRN/H5) shows a toast indicating detection completion when content loads.

The method cannot guarantee every element is fully rendered, but confirms that the page has visible content and is interactive, thus serving as the TTI measurement.

2.2.2 Page Performance Reports

After deployment, APM reports accurately reflect page performance.

The report supports multi‑dimensional filtering and sampling.

Each page group shows TTI distribution; 95th and 90th percentile values can be calculated for performance benchmarking.

Pages are categorized with baseline TTI targets; performance within baseline appears green, exceeding by >20% appears red, guiding optimization efforts.

2.2.3 Page TTI Optimization

Asynchronous main‑thread tasks to background threads.

Prefetch network requests to reduce TTI (e.g., 40% improvement for flight/hotel lists).

Pre‑execute tasks required by the next page during the current page. Download next‑page offline packages in advance.

CRN framework optimizations: upgrade React Native to ≥0.61, enable Hermes on Android; replace asynchronous APIs with synchronous ones where appropriate.

PreRender: delay page navigation to utilize idle time for loading resources, achieving smoother transitions.

2.3 Crash & Jank

The crash system is similar to common crash collection tools but cannot capture all crashes; some crashes are invisible to SDKs or OS logs.

2.3.1 User‑Behavior Crash

On App launch, the system checks whether the previous session crashed by comparing persisted page timestamps; if the interval is unusually short, a crash is inferred. A user‑behavior crash report aggregates these events.

2.3.2 Custom Exception Reporting

Developers often print exceptions without further handling. To improve observability, a logException API and corresponding reports were provided, organizing exceptions by Category (title) and Message (subtitle) with sampling for deeper analysis.

3. Summary

The APM system monitors three core metrics—network performance, page performance, and stability—providing clear visibility, daily reports, and data‑driven optimization guidance, thereby encouraging development teams to prioritize performance quality.

Improved insight into real‑world App performance.

Core metric monitoring and daily reports ensure stable operation.

Standardized optimization targets backed by data.

Motivated business developers to focus on performance.

Beyond the core features described, the system includes extensive custom functions such as alerts, specialized performance modules, and more.

Note: Reply with “apm” to the "Ctrip Technology" WeChat public account to download the presenter’s PPT.

Recommended Reading:

Nearly Ten‑Thousand‑Word Article on Ctrip’s Large‑Scale RN Engineering Practices

Node.js in Ctrip: Deployment and Best Practices

Ctrip Trip.com App Home Page Dynamic Exploration

15% Load Speed Boost: Ctrip’s Research on RN’s New JS Engine Hermes

Ctrip Technology 2019 Annual Collection

APMcrash analysismobile performancenetwork monitoringPage TTI
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.