Systematic iOS Stability Management: From Crash Classification to Advanced Attribution
This article presents a comprehensive framework for identifying, classifying, and resolving iOS stability issues—covering crash types, governance methodology, deep-dive attribution techniques, real-world case studies, and practical tools such as Zombie monitoring, Coredump, MemoryGraph, and MetricKit—to dramatically improve app reliability.
1. Stability Issue Classification
For mobile apps, crashes are the most severe bugs because they prevent users from continuing to use the product, directly affecting retention and commercial value. Data shows that 20% of users consider crashes the most intolerable issue, second only to intrusive ads, and one‑third of users who leave due to experience problems switch to competitors.
ByteDance, with massive apps like Douyin and Toutiao, has heavily invested in stability. Over the past two years, crash rates for Douyin, Toutiao, and Feishu have been reduced by over 30%, with some metrics improving by more than 80%.
Based on iOS crash data, stability problems are divided into five categories (ordered by proportion): OOM (memory over‑use) > 50%, Watchdog (app freeze), ordinary Crash, Disk I/O exception, and CPU exception.
2. Stability Issue Governance Methodology
The governance approach must cover the entire lifecycle from monitoring to remediation. From the monitoring platform side, the system should be able to detect all types of stability problems promptly and accurately.
From the developer side, stability governance should be integrated into every stage of software development—requirements, testing, integration, gray release, and production—so developers consistently prioritize detection and resolution.
Two key governance principles are:
Control new issues,治理存量 – New problems tend to be easy to trigger and have severe impact; existing (legacy) problems are often more complex and take longer to fix.
先急后缓,先易后难 – Prioritize fixing quickly‑triggering and easy‑to‑solve issues first.
The governance workflow includes:
Problem discovery – The monitoring platform must capture any crash, OOM, Watchdog, etc., and notify developers instantly.
Attribution – Developers investigate the root cause, which can be single‑point, common, or burst issues.
Problem remediation – For online issues, quick protection (e.g., runtime crash‑auto‑fix, service rollback) can be applied; otherwise, developers fix the native code and release a new version.
Problem de‑gradation – Automated unit/UI tests, Xcode Instruments, and third‑party tools (e.g., MLeaksFinder) help catch stability problems before release.
The most critical stage, according to ByteDance’s experience, is online attribution, because many unresolved issues stem from developers failing to locate the root cause.
3. Difficult Problem Attribution
Four major difficult problem categories are discussed: Crash, Watchdog, OOM, and CPU/Disk I/O exceptions. Each section provides background, challenges, and concrete attribution tools.
3.1 Crash
Crashes are split into four sub‑categories: Mach exceptions, Unix Signal exceptions, Objective‑C exceptions, and C++ exceptions. Mach exceptions dominate (>80% of long‑standing crashes), mainly due to illegal address accesses (EXC_BAD_ACCESS).
Attribution challenges include:
Pure system call stacks that provide little context.
Intermittent crashes that are hard to reproduce.
Memory‑corruption scenarios where the crash appears in a different module than the actual cause.
Two effective attribution tools are presented:
Zombie detection – Enables Xcode’s Zombie Objects to pinpoint the exact deallocated object that caused a crash, improving reproducibility for OC‑related memory bugs.
Coredump – Uses lldb‑generated core files to capture the full memory state at crash time, allowing offline debugging of Mach and Signal crashes without reproducing the issue.
Real‑world cases from Feishu illustrate how Zombie detection revealed a stray retain on MainTabbarController, and how Coredump helped locate a GCD queue over‑release that caused a system‑assert crash.
3.2 Watchdog (App Freeze)
Watchdog issues often occur during cold start, causing users to wait 10 seconds with no response. Their volume can be 2–3 times that of ordinary crashes, and they can lead to false OOM detections.
Attribution difficulties stem from:
Traditional freeze detection (main‑thread unresponsive >3‑5 s) produces many false positives.
Multiple root causes: deadlocks, lock contention, main‑thread I/O, cross‑process communication.
Two solutions are offered:
Thread‑state monitoring – Captures multiple main‑thread stack traces over time, recording CPU usage, run state, and flags to differentiate deadlocks (CPU 0, waiting) from CPU‑intensive loops.
Deadlock thread analysis – Identifies waiting threads, extracts the waiting method, and builds a lock‑wait graph to automatically detect circular waits.
Case studies show how deadlock detection pinpointed a lock‑competition between a main‑thread mutex and a GCD lock, and how thread‑state analysis highlighted a high‑CPU animation that needed pausing.
3.3 OOM (Out‑of‑Memory)
OOM crashes occur when an app’s memory usage exceeds system limits, leading to forced termination. They disproportionately affect heavy users and can be 3–5 times more frequent than ordinary crashes.
Attribution challenges include the lack of a clear crash stack and difficulty reproducing OOM scenarios offline.
The primary online attribution tool is MemoryGraph , which periodically dumps the app’s memory graph when usage exceeds a threshold, records symbolized objects, and captures strong/weak reference relationships.
A Feishu case revealed that concurrent downloading and decoding of 47 images caused >500 MB of ImageIO objects, traced back to an unbounded NSOperationQueue. Limiting concurrency resolved the issue and reduced crash rate by 8%.
3.4 CPU and Disk I/O Exceptions
These resource‑abnormalities do not cause immediate crashes but lead to performance degradation, overheating, and can evolve into crashes if unchecked.
Attribution is hard because the problems persist over long periods and are hard to reproduce.
Apple’s MetricKit (iOS 14+) provides low‑overhead diagnostics for CPU and Disk I/O anomalies. Collected data can be visualized as flame graphs, where the longest rectangle indicates the hottest call stack. A Feishu mini‑program animation that kept running while hidden was identified as the culprit and paused to fix the issue.
4. Summary
Effective stability governance must permeate every stage of the software lifecycle—discovery, attribution, remediation, and de‑gradation. Online attribution is the most crucial step. For each difficult problem type, ByteDance offers specialized tools: Zombie & Coredump for crashes, thread‑state & deadlock analysis for Watchdog, MemoryGraph for OOM, and MetricKit for CPU/Disk I/O. These solutions have been validated across ByteDance’s product portfolio and are available through the Volcano Engine MARS‑APMPlus platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ByteDance SE Lab
Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
