Mobile Development 27 min read

Systematic iOS Stability Issue Management: Classification, Methodology, and Root‑Cause Attribution

This article presents a comprehensive guide on systematically managing iOS stability problems, covering issue classification, a governance methodology, detailed root‑cause analysis for crashes, watchdogs, OOM, CPU and disk I/O anomalies, and practical tools and case studies from ByteDance’s APM platform.

ByteDance Terminal Technology
ByteDance Terminal Technology
ByteDance Terminal Technology
Systematic iOS Stability Issue Management: Classification, Methodology, and Root‑Cause Attribution

Introduction – The speaker, Feng Yadong, a senior engineer at ByteDance, shares his experience building the company's APM platform and overseeing iOS performance and stability monitoring.

1. Stability Issue Classification

Mobile app crashes, especially sudden app terminations (flash‑outs), are the most severe bugs because they block user interaction and affect retention. Statistics show 20% of users consider flash‑outs the worst issue, second only to intrusive ads. ByteDance categorises iOS stability problems into five major types based on root cause: OOM (memory over‑use), Watchdog (main‑thread hangs), generic Crash, Disk I/O exceptions, and CPU exceptions.

2. Governance Methodology

The methodology stresses full‑coverage monitoring on the platform side and integration of stability concerns throughout the software development lifecycle (requirements, testing, integration, gray‑release, production). Two core principles are highlighted: “Control new issues, govern existing ones” and “Address urgent and easy problems first, then the harder ones.” The workflow includes problem discovery, attribution, remediation, and prevention of degradation.

3. Difficult Issue Attribution

3.1 Crash

Crashes are broken down into Mach exceptions, Unix signals, and Objective‑C/C++ exceptions, with Mach exceptions accounting for the majority. Challenges include system‑only stack traces, intermittent crashes, and cross‑module memory corruption. ByteDance uses two tools: Zombie detection (leveraging Xcode’s zombie objects) and Coredump analysis via LLDB.

3.2 Watchdog (Hang)

Watchdog issues often occur during cold start, causing app freezes that are 2‑3× more frequent than ordinary crashes. Attribution difficulties stem from multi‑stage latency, lock contention, and deadlocks. Solutions include multi‑snapshot thread‑state collection, automatic deadlock detection, and visualisation of lock‑waiting graphs.

3.3 OOM (Out‑of‑Memory)

OOM crashes are especially harmful to heavy users and can be 3‑5× more common than normal crashes. Attribution is hard because memory usage lacks explicit stack traces. ByteDance employs an online MemoryGraph tool that periodically dumps memory, records object relationships, and helps pinpoint leaks, excessive allocations, or misuse of resources.

3.4 CPU & Disk I/O Anomalies

These resource‑abnormalities may not cause immediate crashes but lead to severe performance degradation. Attribution requires long‑term sampling of call stacks, which is costly. Apple’s MetricKit (iOS 14+) provides low‑overhead diagnostics; ByteDance integrates MetricKit to collect CPU and disk‑I/O data, visualise them as flame graphs, and identify offending code paths.

4. Summary

The speaker reiterates that effective stability management must permeate every stage of the development lifecycle, with online problem attribution—especially for hard‑to‑reproduce issues—being the most critical. The presented tools (Zombie, Coredump, thread‑state analysis, deadlock detection, MemoryGraph, MetricKit) are largely proprietary to ByteDance and will be offered through the Volcano Engine MARS‑APMPlus platform.

ByteDance also announces a 30‑day free trial of the MARS‑APMPlus service, covering App, Web, Server, and Mini‑Program monitoring.

mobile developmentiOSAPMPerformance Monitoringcrash analysisstability
ByteDance Terminal Technology
Written by

ByteDance Terminal Technology

Official account of ByteDance Terminal Technology, sharing technical insights and team updates.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.