Mobile Development 17 min read

How DingTalk’s ANRCanary Turns Android ANR Detection from Guesswork to Precision

This article introduces DingTalk’s self‑built ANRCanary, explains the complete Android ANR lifecycle, details how it captures and aggregates main‑thread tasks, filters false positives, and provides concrete code‑level insights to help mobile developers pinpoint and resolve ANR issues efficiently.

Alibaba Terminal Technology
Alibaba Terminal Technology
Alibaba Terminal Technology
How DingTalk’s ANRCanary Turns Android ANR Detection from Guesswork to Precision

Glossary

The series "DingTalk ANR Governance Best Practices" begins with "Locating ANR without Fog", introducing the self‑developed ANRCanary that monitors main‑thread execution to provide richer information for ANR diagnosis.

System ANR Full Process

The system ANR process consists of three parts:

Timeout detection

ANR information collection

ANR information output

For timeout detection logic, see Android source ProcessRecord.java. The collection and output flows are illustrated below:

Key insights from the system source:

ANR Trace stack capture is delayed and may not represent the root cause.

System Server sends SIGQUIT to multiple processes to request stack dumps.

The app can detect a foreground ANR via its process ANR error state.

ANR Trace Like Cutting a Boat for a Sword

When a broadcast causes an ANR, System Server detects a timeout, sends SIGQUIT, and the app dumps all thread stacks as ANR Trace.

The dump timing is delayed; the actual long‑running message that caused the ANR may have already finished, while another message is captured as a scapegoat, making the ANR Trace unreliable.

ANR False‑Positive Filtering

Because SIGQUIT does not guarantee a foreground ANR, DingTalk adds a secondary confirmation:

After receiving SIGQUIT, poll the process error state for 20 seconds to confirm a foreground ANR.

Background ANRs cause the system to kill the process, while other processes' ANRs do not; a persistent record distinguishes them.

Flowchart:

ANR Monitoring Tool

DingTalk’s self‑built ANRCanary continuously records the execution time of the latest main‑thread tasks. When an ANR occurs, it locates the root cause based on the longest‑running message.

Compared with the traditional ANR Trace, ANRCanary expands from a single snapshot to a timeline of main‑thread task durations, solving the “cut‑a‑boat” problem.

Historical Task Monitoring

Android main‑thread tasks can be roughly classified as:

Handler messages : the most common main‑thread tasks.

IdleHandler : executed when the message queue becomes idle.

nativePollOnce : triggered from the native layer, including touch events, sensor events, etc.

The goal of historical task monitoring is to capture the start and end times of each main‑thread task, using appropriate hook methods.

FakeIdle exclusion method:

Stacks that remain in nativePollOnce are identified as idle tasks.

Tasks that are neither Message nor IdleHandler nor within idle periods are labeled as FakeIdle tasks.

Historical Task Aggregation

For ANR analysis, long‑running tasks are the focus; short tasks can be ignored. Aggregation reduces memory operations and compresses redundant data.

Aggregated historical task records are categorized as:

Aggregated : multiple short tasks whose cumulative duration exceeds a threshold.

Huge : a single task exceeding the threshold; preceding short tasks are aggregated separately.

Idle : periods when the main thread is idle.

Key : messages from the four Android components that may trigger ANR.

Freeze : tasks frozen when the app is backgrounded on certain devices, resuming only when foregrounded.

Current Running Task

ANRCanary shows the currently running task, helping developers quickly rule out interference.

{
    "runningTaskInfo":{
        "stackTrace":[
            "android.os.MessageQueue.nativePollOnce(Native Method)",
            "android.os.MessageQueue.next(MessageQueue.java:363)",
            "android.os.Looper.loop(Looper.java:176)",
            "android.app.ActivityThread.main(ActivityThread.java:8668)",
            "java.lang.reflect.Method.invoke(Native Method)",
            "com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)",
            "com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)"
        ],
        "type":"IDLE",
        "wallDuration":519
    }
}

In this example, the main thread was idle for 519 ms when the ANR occurred.

Pending Message List

The pending message list reveals:

Whether messages are blocked and for how long, indicating main‑thread load.

Potential Barrier message leaks that can permanently block the thread.

Duplicate messages that may point to business‑logic errors filling the queue.

Overall, ANRCanary captures past, present, and future stages of main‑thread activity.

Main Thread Stack Sampling

Since the internal logic of each main‑thread task is a black box, stack sampling helps pinpoint the exact code causing delays.

Avoid frequent addition/removal of timeout tasks.

Only long‑running tasks trigger stack capture.

Minimize the number of stack captures.

Implementation details:

A dedicated sampling thread performs stack dumps.

The main‑thread task listener notifies the sampling thread of task start and end.

Stack capture is triggered only when a task exceeds a minimum timeout.

After a capture, the timeout is gradually increased until the task finishes.

If a later task detects a timeout progression, the sampling queue is cleared and the timeout reset.

Case Sharing

A test reported frequent ANRs during long‑duration stress testing, blocking the test flow.

BugReport ANR Trace pointed to a sensor‑event handling delay:

"main" prio=5 tid=1 Runnable
  ...
  at xxx.handleSensorEvent(SourceFile:???)
  ...

CrashSDK ANR Trace indicated a hardware rendering issue:

"main" prio=10 tid=1 Native
  ...
  at android.view.ThreadedRenderer.nSyncAndDrawFrame(Native Method)
  ...

ANRCanary provided the decisive information:

{
  "cpuDuration":9,
  "messageStr":">>>>> Dispatching to Handler(android.view.Choreographer$FrameHandler){3b01fdc} android.view.Choreographer$FrameDisplayEventReceiver@bdac8e5: 0",
  "threadStackList":[
    {
      "stackTrace":[
        "android.view.ThreadedRenderer.nSyncAndDrawFrame(Native Method)",
        "android.view.ThreadedRenderer.draw(ThreadedRenderer.java:823)",
        "android.view.ViewRootImpl.draw(ViewRootImpl.java:3321)",
        ...
      ],
      "state":"RUNNABLE",
      "wallTime":65347
    }
  ],
  "type":"HUGE",
  "wallDuration":68497
}

ANRCanary showed that the hardware rendering phase consumed 68 seconds, while the sensor event task only took 12 ms.

Ultimately, developers traced the blockage to a lock wait in the device’s hardware rendering layer.

Future

This article introduced ANRCanary’s rich monitoring data for ANR diagnosis. Because the logs are extensive, extracting the root cause can be challenging.

The next article will discuss DingTalk’s analysis algorithm that attributes ANRs and reports them to a monitoring platform, helping developers resolve ANRs faster and more accurately.

Reference

ProcessRecord.java: http://aospxref.com/android-10.0.0_r47/xref/frameworks/base/services/core/java/com/android/server/am/ProcessRecord.java#1424

AndroidPerformance MonitoringANRDingTalk
Alibaba Terminal Technology
Written by

Alibaba Terminal Technology

Official public account of Alibaba Terminal

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.