Design and Implementation of an Application Performance Management (APM) System for Mobile Apps
This article introduces the background, core features, network and page performance modules, error monitoring, reporting, and optimization practices of a mobile App Performance Management (APM) system, highlighting its impact on monitoring, diagnostics, and performance improvement for large‑scale applications.
1. Background Introduction
APM (Application Performance Management) is essential for monitoring performance metrics of a mature App. The previous 1.0 version suffered from weak filtering, lack of daily reports and alerts, mixed functionalities, and single‑App support. A 2.0 reconstruction was initiated, redefining APM’s positioning and required functions.
The final APM scope includes Data Reports, Performance Daily Reports, Monitoring & Alerts, and a Troubleshooting Entry.
Data Reports: precise end‑to‑end data, multi‑dimensional filtering, multi‑App support, real‑time reports.
Performance Daily Reports: core metric email reports, subscription support.
Monitoring & Alerts: core metrics alerts, Crash rate, JSError alerts.
Troubleshooting Entry: multi‑dimensional abnormal data sampling, integration with internal systems.
The functional modules cover Network Performance, Page Performance, Crash & Jank, and Specialized Performance.
2. Main Features
2.1 Network Performance
Network request performance is a core metric; the APM platform uses end‑to‑end data to reflect real user experience.
2.1.1 Network Architecture Model
Dynamic network requests in the App are sent through a self‑developed network communication framework to a backend Gateway, which forwards them to actual business servers.
Key points:
The communication framework and Gateway use TCP long connections.
Data protocol is a custom contract format.
The framework can convert HTTP to the custom protocol for proxy forwarding.
Gateway handles minimal business logic, focusing on link management and routing; it is globally deployed for better user experience.
All dynamic requests pass through the framework, enabling precise sampling for APM reports.
2.1.2 Error Monitoring Dimensions
Most network request anomalies occur during link establishment and data transmission. Custom TCP link management allows defining error codes for each failure point.
Code
Definition
Description
-202
Request serialization failure
Very rare occurrence
-203
No available link
Common network instability
-204
Send request failure
Socket send failure, very rare
-205
Read response error
Link abnormality, cannot read expected response header length
-206
Response deserialization failure
Very rare
-212
Link abnormal disconnection
Includes client network issues and backend‑initiated disconnects
-213
Unable to read response
Common timeout
Standard HTTP 4xx/5xx errors are also recorded, but the above codes focus on custom TCP link anomalies.
2.1.3 Monitoring Dimensions & Goals
Two primary dimensions: request success rate and request latency.
Success Rate = successful requests / (successful + failed requests) User interaction target: ≥99% Overall average target: ≥98% (including launch/background scenarios)
Latency = time from client request start to response receipt and deserialization User interaction target: backend processing time + 300 ms (RTT)
2.1.4 APM Reports
Sample screenshots of network performance reports:
Version‑wise performance overview with clickable filters.
Service‑specific metrics: success rate, total latency, server processing latency, sample size, error code distribution.
Each table includes a sampling button to select error‑prone or high‑latency device IDs, linking to an internal troubleshooting system for detailed analysis.
Internal troubleshooting tool can order all requests of a link ID, greatly improving network issue resolution efficiency.
2.1.5 Performance Optimization Practices
Custom communication protocol for full link control and easier debugging.
Intelligent ServerIP selection: same‑carrier IP for domestic users, overseas servers for overseas users, timezone‑based selection for startup.
Reasonable timeout settings: avoid overly short timeouts (e.g., 3 s) that hurt success rate.
Enable retries for idempotent services to boost success rate.
2.2 Page Performance
2.2.1 Page Performance Statistics Scheme
Traditional approach records a start timestamp at page initialization and an end timestamp after all required services finish, calculating the interactive time. This method is accurate but requires per‑page instrumentation and maintenance.
To reduce developer burden, a framework‑level solution was sought to automatically measure Time To Interactive (TTI).
Initial idea: capture a screenshot, trim header/footer, divide the middle area into six blocks, randomly sample pixels, and compare similarity.
Problems discovered:
Significant performance overhead (~100 ms per screenshot).
Skeleton screens were mistakenly treated as fully rendered.
Random sampling caused variability on simple pages.
Because of these issues, the solution was not widely adopted.
Observation: page rendering completion is usually accompanied by visible text, and text scanning incurs lower overhead than image capture.
New approach replaces screenshot‑pixel detection with component traversal and text detection.
Start page initialization, traverse all elements, detect text.
If text is within top 20% or bottom 25% of the page, ignore; need at least two text groups to consider detection successful.
If not successful, wait 50 ms and retry; total timeout 10 s.
Demo video (Native/CRN/H5) shows a toast indicating detection completion when content loads.
The method cannot guarantee every element is fully rendered, but confirms that the page has visible content and is interactive, thus serving as the TTI measurement.
2.2.2 Page Performance Reports
After deployment, APM reports accurately reflect page performance.
The report supports multi‑dimensional filtering and sampling.
Each page group shows TTI distribution; 95th and 90th percentile values can be calculated for performance benchmarking.
Pages are categorized with baseline TTI targets; performance within baseline appears green, exceeding by >20% appears red, guiding optimization efforts.
2.2.3 Page TTI Optimization
Asynchronous main‑thread tasks to background threads.
Prefetch network requests to reduce TTI (e.g., 40% improvement for flight/hotel lists).
Pre‑execute tasks required by the next page during the current page. Download next‑page offline packages in advance.
CRN framework optimizations: upgrade React Native to ≥0.61, enable Hermes on Android; replace asynchronous APIs with synchronous ones where appropriate.
PreRender: delay page navigation to utilize idle time for loading resources, achieving smoother transitions.
2.3 Crash & Jank
The crash system is similar to common crash collection tools but cannot capture all crashes; some crashes are invisible to SDKs or OS logs.
2.3.1 User‑Behavior Crash
On App launch, the system checks whether the previous session crashed by comparing persisted page timestamps; if the interval is unusually short, a crash is inferred. A user‑behavior crash report aggregates these events.
2.3.2 Custom Exception Reporting
Developers often print exceptions without further handling. To improve observability, a logException API and corresponding reports were provided, organizing exceptions by Category (title) and Message (subtitle) with sampling for deeper analysis.
3. Summary
The APM system monitors three core metrics—network performance, page performance, and stability—providing clear visibility, daily reports, and data‑driven optimization guidance, thereby encouraging development teams to prioritize performance quality.
Improved insight into real‑world App performance.
Core metric monitoring and daily reports ensure stable operation.
Standardized optimization targets backed by data.
Motivated business developers to focus on performance.
Beyond the core features described, the system includes extensive custom functions such as alerts, specialized performance modules, and more.
Note: Reply with “apm” to the "Ctrip Technology" WeChat public account to download the presenter’s PPT.
Recommended Reading:
Nearly Ten‑Thousand‑Word Article on Ctrip’s Large‑Scale RN Engineering Practices
Node.js in Ctrip: Deployment and Best Practices
Ctrip Trip.com App Home Page Dynamic Exploration
15% Load Speed Boost: Ctrip’s Research on RN’s New JS Engine Hermes
Ctrip Technology 2019 Annual Collection
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.