Android ANR Monitoring, Diagnosis and Governance Practice
The article explains how to monitor, locate, and remediate Android ANRs by installing a LooperPrinter‑based timeout detector, extending coverage to IdleHandler and touch events, dumping and aggregating main‑thread stacks via Firebase Crashlytics, and showcases real‑world fixes that cut online ANRs by 73.8%.
ANR (Application Not Response) refers to the application not responding, typically occurring when the main thread is blocked, accompanied by an ANR popup. The challenge in managing ANR is that unlike crashes, there are no crash logs, making problem diagnosis difficult. However, ANR brings extremely poor user experience and must be resolved.
This article focuses on Android ANR issues from three aspects: ANR statistics, ANR location diagnosis, and online ANR governance summary.
ANR Statistics: The core principle of ANR monitoring is to set up monitoring in the application layer for the main thread and perform timeout detection asynchronously. The specific approach is to set a LooperPrinter for the main thread Looper, distinguish message execution start and end through callback log parameters ">>>>>>" and "<<<<<", and calculate message execution time. When execution time exceeds a custom threshold (e.g., 5 seconds), an ANR is considered to have occurred.
The existing monitoring framework has blind spots: it cannot monitor IdleHandler stalling and View#TouchEvent stalling. For IdleHandler, the solution is to use reflection to replace mIdleHandlers with a custom MyArrayList to monitor queueIdle() method execution time. For TouchEvent, PLT Hook on libinput.so can theoretically verify touch event stalling.
ANR Stack Trace Location: When ANR occurs, the goal is to obtain where the main thread is stuck. The solution is to dump the current main thread stack information in a background thread and upload it to the APM platform. However, the challenge is stack aggregation and deduplication. The solution leverages Firebase Crashlytics' stack aggregation capability by using reflection to call its private API for custom stack reporting.
Real Cases: Case 1: String.format() performance issue - String.format() is 40-60 times slower than simple string concatenation under frequent calls. Case 2: Network diagnostic library creates multiple 30-second loops when each page is created, causing system service acquisition to become increasingly complex. Case 3: Using Runtime.getRuntime().exec() to determine Xiaomi ROM causes blocking calls that easily lead to freezes on certain devices.
After implementing these fixes, online ANR occurrences decreased by 73.8%.
NetEase Yanxuan Technology Product Team
The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.