Crash Governance and Stability Practices for Mobile Applications
This article describes a comprehensive crash governance framework for a fast‑growing mobile app, covering monitoring, root‑cause attribution, systematic remediation steps, and detailed case studies—including MIUI system bugs, 32‑bit address‑space limits, memory corruption, and WebView crashes—while outlining future challenges and automation strategies.
Background: The rapid development of the Soul app led to a soaring crash rate, causing negative user feedback, churn, and financial loss. By establishing a dedicated crash‑governance team and improving infrastructure, the crash rate was reduced from roughly 5 per thousand to 8 per ten thousand, reaching industry‑leading levels.
Basic Concepts: The governance model is divided into three pillars—foundation, lifecycle coverage, and governance principles. Foundations include development standards, on‑call rotation, dynamic gray releases, and a high‑sensitivity alert system. Lifecycle ensures crash‑handling capabilities are embedded at every stage of the software life. Principles stress controlling new crashes while fixing existing ones, and tackling easy, high‑impact issues before harder ones.
Governance Measures: Crash detection relies on third‑party Bugly and an in‑house SCrash monitor, daily automated Monkey testing of stable builds, and real‑time alert synchronization to on‑call groups. A gray‑release mechanism allows staged roll‑outs with continuous monitoring. Attribution uses Bugly’s version distribution, trend analysis, and detailed logs (e.g., thread, device, OS) to pinpoint root causes.
Problem‑Solving Steps: 1) Discover the issue via monitoring or automated tests. 2) Collect data (version, thread, device, logs). 3) Form hypotheses based on patterns. 4) Prioritize hypotheses by likelihood and verify them against online data. 5) Apply fixes (hotfix, configuration change, code patch) and monitor impact.
Case Study – MIUI System Bug: On certain MIUI devices, eglSetDamageRegionKHR returning false caused crashes in libhwui.so. The team hooked the function via PLT to force a true return, eliminating the crash. The relevant code snippet is shown below:
Elf_Rela *rel_table = (Elf_Rela *) (jmpRelOff + base_addr);
// 4.遍历 got 表
for (size_t i = 0; i < ptlRelSz; i++) {
size_t ndx = ELF_R_SYM(rel_table[i].r_info);
Elf_Sym *symTable = (base_addr + symTabOff + ndx * sizeof(Elf_Sym));
char *funcName = (char *) (symTable->st_name + base_addr + strTabOff);
if (strcmp(hookFuncName, funcName) == 0) {
size_t page_size = getpagesize();
uintptr_t addr = (uintptr_t) (base_addr + rel_table[i].r_offset);
uintptr_t mem_page_start = (uintptr_t) PAGE_START(addr);
int result = mprotect(mem_page_start, page_size, PROT_READ | PROT_WRITE);
if (result == -1) { return ERROR; }
old_eglSetDamageRegionKHR = *(void **) addr;
*((uintptr_t *) (addr)) = (uintptr_t) new_eglSetDamageRegionKHR;
__builtin___clear_cache((void *) PAGE_START(addr), (void *) PAGE_END(addr));
return CORRECT;
}
}Case Study – 32‑bit Address‑Space Exhaustion: Crashes were traced to 32‑bit processes hitting virtual memory limits, often resulting in OOM. The solution involved memory optimizations and a push toward 64‑bit builds, coupled with enhanced leak detection.
Case Study – Memory Corruption: Increasing use of native .so libraries introduced hard‑to‑detect memory‑corruption crashes. The team adopted the open‑source Memguard tool for runtime detection and plans to integrate AddressSanitizer (ASAN) during offline development.
Case Study – WebView Crashes: When WebViewClient.onRenderProcessGone returned false, unhandled render‑process failures caused app crashes. By Java‑hooking org.chromium.android_webview.AwContents#onRenderProcessGone and adding instrumentation, the offending ad‑SDK code was identified and fixed.
Challenges & Outlook: Stability governance is an ongoing effort requiring multi‑dimensional monitoring, refined alert thresholds, and disaster‑recovery mechanisms such as hot‑fixes, downgrade signaling, dynamic configuration, and automated gray‑release control. Future work includes deeper memory monitoring, richer alert attribution, and self‑healing capabilities.
Soul Technical Team
Technical practice sharing from Soul
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.