How Xianyu Tackles Android ANR: Monitoring, Diagnosis, and Optimization Strategies
This article explains how Xianyu identifies, monitors, and resolves Android ANR issues by analyzing root causes, implementing SIGQUIT‑based detection, inspecting thread stacks, and applying concrete optimizations such as SharedPreferences replacement, network broadcast caching, and delayed component registration, ultimately cutting ANR rates by more than half.
Background
During rapid iteration of the Xianyu Android app, frequent Application Not Responding (ANR) events caused dialog prompts, forced termination, and user churn. ANRs are hard to reproduce offline because they depend on device fragmentation, system state, and user behavior, requiring robust monitoring and targeted fixes.
Root Causes of ANR
Main‑thread overload: long‑running messages, message‑queue congestion, deadlocks, or missing scheduling of critical messages.
System overload: other threads or resources (high I/O, memory churn) heavily contend for CPU, preventing the main thread from being scheduled.
Monitoring Solutions
FileProvider on /data/anr/traces.txt (deprecated)
Earlier attempts used a FileProvider to watch /data/anr/traces.txt, but Android 6.0+ restricts access, leading to missed ANRs on newer devices.
Main‑Thread Timeout Polling
A background thread periodically posts a message to the main thread (e.g., every 5 seconds). If the message is not consumed, the main thread is considered blocked and the system service is queried for error info. This method suffers from high false‑negative rates and performance overhead.
SIGQUIT Signal Listening (preferred)
When the system detects an ANR it sends a SIGQUIT signal to the process, causing a stack dump. By listening for SIGQUIT the app can reliably detect ANRs with minimal impact. After receiving SIGQUIT, the app queries the system service to confirm the error belongs to its own process, filtering out cross‑process false positives. This approach is widely adopted in the industry.
Investigation Framework
After an ANR is captured, the Crash SDK extracts the ANR trace, which contains stack traces of all threads. Analysts can then identify main‑thread stalls, deadlocks, or sleep calls.
Photo‑gallery scenario: the main thread waited on a background thread.
WebView scenario: the main thread repeatedly called Thread.sleep() while waiting for resource initialization.
Further analysis of nativePollOnce stacks revealed three typical situations:
No pending messages; the thread sleeps awaiting new events.
Message queue blocked by a synchronous barrier, causing nativePollOnce to wait.
Trace generation itself is time‑consuming, causing the ANR to be recorded after the offending message.
For case 2 a hook on the message queue can detect barrier leaks (none observed in production samples). For case 3 the app records recent message‑queue history and uploads it together with the ANR dump for post‑mortem analysis.
Implementation: Looper Printer
By setting a custom Printer on the main Looper, each dispatched message’s target, callback, and what value are logged with timestamps. A secondary thread samples the main‑thread stack when messages are processed, correlating stack snapshots with specific messages.
public final class Looper {
public static void loop() {
...
for (;;) {
final Printer logging = me.mLogging;
if (logging != null) {
logging.println(">>>> Dispatching to " + msg.target + " " + msg.callback + ": " + msg.what);
}
try {
msg.target.dispatchMessage(msg);
} finally {
...
}
if (logging != null) {
logging.println("<<<<<< Finished to " + msg.target + " " + msg.callback);
}
}
...
}
}String concatenation is limited to a small sampling group to reduce overhead.
Effectiveness
Monitoring revealed a 155 ms message and a 411 ms clock‑wait, both caused by heavy initialization on the main thread and cross‑process calls, which blocked subsequent Receiver/Service messages and triggered ANR warnings.
Optimization Cases
SharedPreferences Replacement
ANR traces indicated three SharedPreferences‑related patterns:
Main thread waiting for apply() persistence.
Direct commit() on the main thread.
Blocking while loading SharedPreferences data.
Replacing SharedPreferences with MMKV eliminated these stalls. Performance tests (1,000 read/write cycles) showed MMKV’s superior speed. A compile‑time aspect intercepts all getSharedPreferences calls, returning either MMKV or the original implementation based on a whitelist, without requiring code changes in business modules.
Network Broadcast Listener Optimization
Frequent getActiveNetworkInfo IPC calls from many broadcast listeners caused cumulative latency. The solution proxies IConnectivityManager, caches network state, and updates the cache asynchronously, allowing listeners to read the cached value instead of performing repeated IPC.
Delayed Registration of Startup Components
Serial tasks in Application.onCreate block the main thread, preventing timely handling of critical system messages. The fix delays registration of Receivers, Services, and other components until after onCreate completes, or registers them on a background handler, reducing startup‑time ANRs.
public class MyApplication extends Application {
@Override
public void onCreate() {
// time‑consuming serial tasks...
isInitDone = true;
}
@Override
public Intent registerReceiver(final BroadcastReceiver receiver, final IntentFilter filter) {
if (isInitDone) {
return super.registerReceiver(receiver, filter);
}
mainHandler.post(new Runnable() {
@Override
public void run() {
MyApplication.super.registerReceiver(receiver, filter);
}
});
return null;
}
}Summary and Outlook
After upgrading monitoring and investigation capabilities, Xianyu reduced its ANR rate by more than 50%, delivering a smoother user experience. Future work includes further off‑loading of critical messages to background threads and strengthening automated stability testing to catch regressions early.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
