Understanding and Optimizing iOS App Freeze and Lag Monitoring with Heimdallr
This article explains the definitions, causes, and monitoring strategies for iOS app freezes and lags, describes the Heimdallr solution based on Runloop event callbacks, discusses ANR and WatchDog handling, and presents practical optimizations and lessons learned for reliable performance monitoring.
In iOS apps, freezes ("卡死") and lags ("卡顿") are critical performance metrics that affect user experience, retention, and DAU. This article introduces the monitoring principles of Heimdallr for these issues and shares iterative optimizations derived from extensive production experience.
What are freezes and lags? A lag is a short‑term UI blockage where the screen does not update, typically lasting from a few hundred milliseconds to a few seconds. A freeze is a longer blockage (≥5 seconds) that may lead to the system killing the app, comparable to a crash. Both are categorized into three severity levels based on blockage duration.
Root causes stem from the fact that UIKit is not thread‑safe, so all UI work must run on the main thread. The main thread renders UI every 16 ms (60 fps). Any time‑consuming operation on the main thread blocks the Runloop, preventing UI refresh. The Runloop processes events through six callbacks: RunloopEntry , RunloopBeforeTimers , RunloopBeforeSources , RunloopBeforeWaiting , RunloopAfterWaiting , and RunloopExit . Blocking any of these stages stalls UI and user interaction.
Monitoring solution – To detect main‑thread blockage, Heimdallr registers callbacks for the Runloop events and uses a signal mechanism to forward the Runloop state to a dedicated monitoring thread. If the signal exceeds a configurable threshold, the monitoring thread records the duration, captures stack traces, and reports the anomaly. This approach provides precise insight into which Runloop phase is delayed and for how long.
ANR (lag) handling – When the main thread is blocked beyond a predefined lag threshold T, a full‑thread stack dump is taken, the lag duration is measured, and the information is reported. If the thread never recovers, the situation is escalated to the freeze (WatchDog) module.
WatchDog (freeze) handling – A freeze is a longer, unrecoverable blockage. When the blockage exceeds the system‑defined limit (default ~8 s), Heimdallr captures a full‑thread stack, saves it locally, and continues periodic sampling (default 1 s) to update the estimated freeze duration. Upon app restart, the saved data is used to reconstruct the freeze timeline and report the event.
Optimization challenges – Real‑world deployment revealed false‑positives in lag detection, excessive overhead in full‑thread dumps, and mis‑reports caused by background task suspension. To improve accuracy, a sampling strategy was introduced: instead of a single timeout, the lag period is divided into smaller intervals, and the main‑thread stack is sampled at each interval. Repeated occurrences of the same stack beyond a sampling threshold trigger a full dump, while occasional spikes are ignored.
Additional refinements include: reducing Runloop callback overhead by disabling RunloopBeforeTimers monitoring, filtering background freezes using sample_flag , handling OC Runtime lock‑induced deadlocks by re‑implementing critical paths in C/C++, and providing a timeline view of stack changes (supported from Heimdallr 0.7.15).
Conclusion – After iterative improvements, the ANR and WatchDog modules of Heimdallr have become comprehensive, stable, and reliable. The solution incorporates ideas from open‑source APM frameworks and continuous feedback from users. Future work will focus on further reducing false‑positives and adding non‑intrusive freeze‑prevention features.
Finally, the article includes a promotional note about Volcano Engine APMPlus, offering a free 60‑day performance‑monitoring package for small‑to‑medium businesses, with up to 60 million events per month.
ByteDance Terminal Technology
Official account of ByteDance Terminal Technology, sharing technical insights and team updates.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.