Detecting Android ANR Root Causes: Thread Deadlocks, Barrier Leaks, and Attribution
This article explains how ANRCanary attributes Android ANR problems by analyzing thread deadlocks, barrier message leaks, aggregation signatures, and key‑function extraction, providing developers with concrete detection algorithms and practical examples to resolve high‑impact ANRs.
Introduction
The previous article "DingTalk ANR Governance Best Practices | Locating ANR Without Guesswork" showed that ranking the ANR monitoring platform by nativePollOnce is misleading because ANR Trace aggregation cannot pinpoint the root ANR cause. This article focuses on ANRCanary's attribution algorithm and aggregation reporting to help engineers quickly analyze and locate top‑level ANR issues.
1. Glossary
2. Other ANR Causes
In an app's runtime environment, the main thread is not isolated, so ANRs can arise from more than just long‑running main‑thread tasks. Below are two additional ANR scenarios encountered in DingTalk.
2.1 Thread Deadlock Detection
Thread deadlocks can block the main thread, leading to ANR.
Background
Deadlocks occur when two or more threads enter a circular wait for locks.
The key is to obtain each thread’s held‑lock and waiting‑lock information and apply a directed‑acyclic‑graph (DAG) algorithm to detect circular dependencies.
Obtaining Thread Lock Information
First, look at VMStack:
/**
* @hide
*/
public final class VMStack {
...
/**
* @hide
*/
@SystemApi(client = MODULE_LIBRARIES)
native public static @Nullable AnnotatedStackTraceElement[] getAnnotatedThreadStackTrace(Thread t);
...
}The hidden method VMStack#getAnnotatedThreadStackTrace() returns an array of AnnotatedStackTraceElement, which contains lock state for each stack frame.
Next, examine AnnotatedStackTraceElement:
/*
* A class encapsulating a StackTraceElement and lock state. This adds
* critical thread state to the standard stack trace information, which
* can be used to detect deadlocks at the Java level.
*/
@SystemApi(client = MODULE_LIBRARIES)
public final class AnnotatedStackTraceElement {
private StackTraceElement stackTraceElement;
private Object[] heldLocks;
private Object blockedOn;
}Fields: heldLocks – array of lock objects currently held by the thread. blockedOn – the lock object the thread is waiting for.
Using reflection, these lock details can be retrieved and fed into deadlock detection.
Complete Deadlock Detection Flow
The process:
The deadlock module obtains all thread objects and calls VMStack#getAnnotatedThreadStackTrace() to get AnnotatedStackTraceElement[].
It builds a collection of Nodes, each representing a lock dependency (held lock → waiting lock).
The Node collection is passed to a DAG module for cycle detection.
If a cycle is found, the module reports a deadlock.
Case Study: Sub‑process Thread Deadlock Causing Main‑process ANR
{
"case1":{
"threadName":"thread-1",
"threadStackList":[
"com.alibaba.dingtalk.android.o.a(Unknown Source:???)",
"- waiting on <90707987> (a com.alibaba.dingtalk.android.o)",
"com.alibaba.dingtalk.android.q.a(SourceFile:???)",
"- locked <106576464> (a com.alibaba.dingtalk.android.v)",
"com.alibaba.dingtalk.android.v.a(SourceFile:???)",
"- locked <106576464> (a com.alibaba.dingtalk.android.v)",
"com.alibaba.dingtalk.android.xxx.hta(SourceFile:???)",
"com.alibaba.dingtalk.mp.service.psc$b.run(SourceFile:???)",
"android.os.Handler.handleCallback(Handler.java:900)",
"android.os.Handler.dispatchMessage(Handler.java:103)",
"android.os.Looper.loop(Looper.java:219)",
"android.os.HandlerThread.run(HandlerThread.java:67)"
]
},
"case2":{
"name":"thread-2",
"threadStackList":[
"com.alibaba.dingtalk.android.r.a(SourceFile:???)",
"- waiting on <106576464> (a com.alibaba.dingtalk.android.v)",
"com.alibaba.dingtalk.android.r.a(SourceFile:???)",
"com.alibaba.dingtalk.android.o.a(SourceFile:???)",
"- locked <90707987> (a com.alibaba.dingtalk.android.o)",
"com.alibaba.dingtalk.android.r.b(SourceFile:???)",
"com.alibaba.dingtalk.android.o$h.b(SourceFile:???)",
"com.alibaba.dingtalk.android.r0$b.b(SourceFile:???)",
"com.alibaba.dingtalk.android.d0$d.run(SourceFile:???)",
"android.os.Handler.handleCallback(Handler.java:900)",
"android.os.Handler.dispatchMessage(Handler.java:103)",
"android.os.Looper.loop(Looper.java:219)",
"android.os.HandlerThread.run(HandlerThread.java:67)"
]
}
}The example shows two threads each holding one lock and waiting for the other's lock, forming a circular wait. Resolving this sub‑process deadlock also eliminated the main‑process ANR.
2.2 Barrier Message Leak
Barrier message leaks are another cause of nativePollOnce ANR. When a Barrier message is not cleared, normal messages remain blocked, causing continuous ANRs.
Android Barrier Mechanism
Barrier messages are special queue entries that cannot be executed; they ensure UI‑refresh messages run first.
They act like a fence, holding back normal messages until the last async UI message finishes.
If the final async message is lost, the fence stays, blocking normal messages and causing ANR.
Barrier Leak Detection Mechanism
ANRCanary implements the following detection:
A dedicated background thread periodically checks if the first message in the main queue is a Barrier and has been blocked for over 10 seconds; if so, it triggers verification.
The verifier posts three async and three sync messages to the main thread.
Async messages increment a verification counter; sync messages reset it to zero.
If the Barrier is intact, async and sync messages execute in order, leaving the counter at zero.
If the Barrier leaked, only async messages run, leaving the counter at three, which triggers automatic removal of the leaked Barrier.
3. Aggregation Signature
To build a dashboard of ANR causes, each ANR instance is assigned a aggregation signature – a key string that groups similar ANRs together. Requirements:
Different ANR reasons produce different signatures.
Same reason across users, app versions, etc., yields the same signature.
The number of distinct signatures must remain bounded.
Example of a complex “Huge” Android Message task signature:
huge|Choreographer$FrameHandler|Choreographer$FrameDisplayEventReceiver|0|android.widget.ListView.makeAndAddViewThe signature consists of:
Attribution type – primary cause category.
Message info – handler class, runnable class, what field.
Key function info – function that can further split the group.
4. ANR Attribution Calculation
When an ANR occurs, ANRCanary captures first‑hand data: historical tasks, current running task, pending message list, etc. The attribution engine analyses this data to pinpoint the root cause.
5. Key Function Extraction
Because a main‑thread task may involve many functions, the engine normalises stack samples to identify the most time‑consuming function, called the key function.
5.1 Example
Assume each stack depth is 10 and sampling intervals are equal.
Sample 1: five stacks, first two share depth 8, last three share depth 8; the deeper common function in the last three is the key.
Sample 2: four stacks, first two share depth 5, last two share depth 8; the deeper common function in the latter pair is the key.
Sample 3: four stacks, all share depth 5, middle two share depth 8; a weighted calculation decides which side to pick.
5.2 Normalised Weight Calculation
Duration (X‑axis) and depth (Y‑axis) are normalised to [0,1]. The Euclidean distance from the origin determines weight; the function with the larger distance is chosen as the key.
5.3 No Key Function Cases
If a Huge task has only one or no stack, no key function can be derived.
If the only candidate is the root handler (e.g., Handler#handleMessage or Runnable#run), it is treated as having no key function.
6. ANR Attribution Monitoring Platform
Aggregating signatures and counting occurrences yields a ranking of top ANR problems, which differs from the Crash SDK’s nativePollOnce ranking, surfacing previously hidden issues.
7. Next Steps
With the monitoring platform in place, DingTalk can start fixing the most frequent ANR problems. The next article will present real‑world case studies of ANRCanary in action.
References
[1] VMStack source: https://cs.android.com/android/platform/superproject/+/master:libcore/libart/src/main/java/dalvik/system/VMStack.java;l=35
[2] AnnotatedStackTraceElement source: https://cs.android.com/android/platform/superproject/+/master:libcore/libart/src/main/java/dalvik/system/AnnotatedStackTraceElement.java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
