How to Detect and Analyze Android Thread Deadlocks with Automated Monitoring
This article describes the background of Android thread freeze issues caused by deadlocks, presents a client‑server monitoring architecture that captures thread and lock information via system traces, details automated analysis methods for deadlock detection and non‑deadlock causes, and shares the observed performance improvements and future plans.
Problem Background
After each release of the mobile app, developers receive numerous user‑reported issues such as "unread messages not disappearing", "images not showing", and "spinning indicator never stops". Many of these stem from thread deadlocks that render features unusable, forcing users to kill the process and restart. With over 250 modules and 4 million lines of code, preventing deadlocks through coding standards or static analysis tools proved insufficient.
Solution Details
Overall Scheme Overview
The monitoring system consists of a client side and a backend side.
Client side includes a WatchThread that observes a target thread. If a message taken from the target thread’s Looper queue does not finish within a configurable timeout (default 3 minutes), the client records the thread’s held and waiting locks and reports this data to the backend.
Backend side runs an automated analysis tool that processes the reported data, identifies the cause of the freeze, and creates tickets for further handling.
Client Reporting
Freeze Information
The key information to report are the thread details and the locks it holds or waits for. In Java, only blocking locks can cause a freeze; these include synchronized, LockSupport, and Object locks.
For each lock type the required data are:
synchronized lock – holder thread and waiting thread
LockSupport lock – holder thread and waiting thread
Object lock – waiting thread only
Reporting Scheme 1: Capture Java Stack – Not Feasible
Attempting to extract lock information from a Java stack trace failed because the stack did not contain the lock details, making this approach unusable.
Reporting Scheme 2: Capture System traces.txt – Feasible
When an ANR occurs, Android sends a SIGQUIT signal to the process, which triggers the generation of /data/anr/traces.txt. This file contains thread stacks and lock information, allowing the client to report freeze data after forcing the file creation.
Reporting Difficulty: Traces Lack LockSupport Holder Info
Analysis showed that while synchronized and Object lock information appears in the trace, LockSupport lock holder threads are missing.
Solution: Actively Record LockSupport Thread Info
By adding instrumentation in the database‑related code, the system records when a thread acquires or releases a LockSupport lock, storing the thread ID and name. This extra information is appended to the end of the trace file before reporting.
Server‑Side Identification
Identification Scheme: Key‑Info Reporting + Automated Analysis
The backend receives three essential pieces of data: the full traces.txt, the manually recorded LockSupport info, and the identifier of the frozen thread. The analysis pipeline extracts lock ownership and waiting relationships, reconstructs lock graphs, and determines whether a deadlock exists.
The algorithm walks the lock graph, detects cycles (deadlocks), and otherwise classifies the freeze into categories such as network, file I/O, HashMap, IPC, GC, database, etc.
Deadlock Example
Two threads, MSF‑Receiver and QQ_DB, each hold one lock while waiting for the other, forming a lock‑list cycle that is identified as a deadlock.
Identification Difficulty 1: Different Addresses for the Same LockSupport Lock
Although the same logical LockSupport lock is used, different threads show different object addresses in the dump, preventing straightforward matching.
Solution: Extract Common Feature and Treat as Same Lock
By recognizing a common stack string such as "SQLiteConnectionPool.waitForConnection", the analysis injects a synthetic lock with a unified identifier, allowing the two addresses to be considered the same lock.
Identification Difficulty 2: Non‑Deadlock Issues
Non‑deadlock freezes are categorized by matching stack‑trace keywords to problem types such as network, file I/O, HashMap, IPC, GC, database, ProcessManager, and PB. The keyword‑to‑category map is continuously refined.
Freeze Monitoring and Automation Effect
Automated analysis on a sample day (Nov 7) produced an overview chart showing the distribution of freeze causes. Deadlocks accounted for 35.6 % and have been fully resolved; other issues such as IO, HashMap, and network have also been addressed, while some categories (IPC, ProcessManager, PB, GC, etc.) remain pending.
Overall thread‑freeze rates have decreased across versions (e.g., MSF thread freeze from 0.3 % to 0.1 %).
Future Plans
Remaining work includes automating ticket creation after analysis and extending LockSupport instrumentation to cover all usages beyond the database, thereby improving deadlock detection coverage.
Tencent TDS Service
TDS Service offers client and web front‑end developers and operators an intelligent low‑code platform, cross‑platform development framework, universal release platform, runtime container engine, monitoring and analysis platform, and a security‑privacy compliance suite.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
