How We Reduced Android OOM Crashes by 99%: Mobile Memory Optimization Secrets
Over the past six months we tackled severe Android memory issues—high OOM crash rates, memory leaks, and large object usage—by implementing systematic profiling, targeted page optimizations, Java and native leak detection tools, and robust monitoring mechanisms, ultimately reducing OOM crashes from 0.8‰ to 0.01‰ and improving app stability.
1. Introduction
When discussing Android app performance, memory management is crucial. Improper memory usage leads to OOM, low app survival, UI jank, etc. Optimizing memory can significantly improve responsiveness, stability, and user experience. We have spent considerable effort addressing memory issues in the driver‑side app and share our findings.
2. Project Status and Results
2.1 Project Status
High OOM‑related crash rate
The highest online OOM crash rate reached 0.8‱, accounting for 20% of total crashes.
High‑frequency OOM pages are core pages and core business flows
Home page and the vehicle‑sticker capture page are the top two OOM pages. Frequent crashes on the home page severely affect user experience, while crashes on the sticker page block drivers from accepting orders.
Lack of defensive mechanisms and memory‑issue checkpoints
Memory leaks relied solely on developers' code quality; no offline monitoring existed, and leaks worsened over multiple releases.
2.2 Governance Results
After prolonged memory governance, the OOM‑induced crash rate dropped from a peak of 0.8‱ to 0.01‱; memory‑hit rate fell from 0.64% to 0.01%; core pages and flows now have zero OOM crashes. Effective offline defensive mechanisms intercepted many leak issues before they reached production.
3. Governance Strategy
Before performance and technical optimization, a clear strategy is essential. Based on the current problems, we defined the following optimization strategies:
3.1 Governance Phases
High‑frequency OOM page governance : Prioritize pages with the highest user impact for maximum ROI.
Java memory leak governance : Address Java‑level leaks and large object allocations that increase OOM probability over time.
Native memory leak governance : Handle native leaks last due to higher cost and lower ROI.
3.2 Defensive Phase
After the above phases, long‑term defensive mechanisms are required to prevent regression. We built a multi‑dimensional monitoring and defense system.
4. Governance Practices
4.1 High‑Frequency OOM Page Special Governance
Typical high‑frequency issues share common traits. Our approach: find common characteristics → reproduce offline → locate and fix.
4.1.1 Home Page OOM Investigation
Finding common traits
Log analysis showed that OOM cases often displayed a large number of new‑order push dialogs, suggesting a link.
Offline reproduction
We simulated continuous dialog display on the home page (one dialog every two seconds) for about 8 minutes.
Home page static, trigger new‑order dialog every 2 seconds.
Run for ~8 minutes.
Profiler showed memory increased by ~50 MB after 8 minutes with no decline, matching the online OOM pattern where the dialog was triggered >2000 times.
Root cause and fix
Analysis revealed that the dialog registered a Lifecycle observer but never deregistered it on dismiss, causing the dialog instances to remain in memory. Adding a single line of code to remove the observer on dismiss resolved the issue.
4.1.2 Vehicle‑Sticker OOM Investigation
Unlike the home page, the sticker capture case lacked obvious business clues. We added a reporting strategy combined with a “memory sponge” that dumps heap snapshots on OOM.
ByteDance memory‑sponge solution: https://juejin.cn/post/7052574440734851085
Finding common traits
Heap snapshots showed byte[] arrays occupying >90% of memory, mainly created by image‑recording classes, indicating a link to camera recording logic.
Offline verification
Simulated the capture process; although OOM was not reproduced, frequent memory churn and GC were observed.
Fix
We introduced object‑pool reuse for recording objects, greatly reducing allocations and GC frequency. After a gray‑release, the OOM issue was resolved.
4.2 Java Memory Leak Governance
Even with JVM GC, leaks occur when GC roots retain references. We focus on Activity/Fragment leaks and unreasonable large objects.
Activity/Fragment leaks
Unreleased large objects
4.2.1 Tools
We evaluated several mature tools and selected appropriate ones for online and offline use.
Name
Company
Principle
Features
GitHub
LeakCanary
Open‑source
WeakReference + GC + analysis
Low integration cost, suitable for offline use.
https://github.com/square/leakcanary
Koom
Kwai
Periodic + threshold + sub‑process dump + Shark analysis + XHook trimming
Comprehensive online monitoring.
https://github.com/KwaiAppTeam/KOOM/blob/master/koom-java-leak/README.zh-CN.md
Matrix
Tencent
ActivityLifecycleCallbacks + weak references for leak detection
Suitable for component and image leaks.
https://github.com/Tencent/matrix/wiki/Matrix-Android-ResourceCanary
Tailor
ByteDance
Heap snapshot trimming via XHook
Lightweight dump library for OOM/ANR.
https://github.com/bytedance/tailor/blob/master/README_cn.md
4.2.2 Practical Summary
Initially we integrated Koom‑Java heap monitoring online, later replaced it with an in‑house tool offering better performance and collection strategies. We adopted a “collect on OOM only” policy to minimize impact on normal users while increasing detection probability.
Polling + threshold‑based dumps affect performance; OOM‑triggered dumps are less intrusive.
Memory exceeding a threshold does not always indicate a problem; OOM‑driven dumps have higher relevance.
Data analysis revealed multiple modules with Java leaks and large object usage, many of which are core business modules with high leak frequency.
Typical Java leak scenarios include:
Handler or Thread inner classes holding outer class references.
Singletons holding interface‑type members that retain outer references.
Static variables retaining Activities.
Unregistered broadcast receivers or system services.
Third‑party SDKs receiving Activity/Fragment context.
Typical unreasonable large‑object scenarios include:
Unreleased Bitmaps after use, especially static references.
Large arrays such as Glide cache pools.
4.3 Native Memory Leak Governance
Native code requires manual allocation/release (malloc/free or new/delete). Missing a single delete can cause leaks, and third‑party .so libraries often introduce uncontrolled leaks.
#include <iostream>
void leakFunc(){
int* p = new int(3);
// delete p; // If omitted, memory leak occurs
}
int main() {
leakFunc();
}We use hook‑based tools to monitor native allocations and deallocations.
4.3.1 Tools
Common native leak detection tools:
Name
Company
Principle
GitHub
malloc debug
Android OS
Replaces libc malloc/free internally
perfetto
Android OS
Based on ftrace, atrace, heapprofd
koom
Kwai
Hook malloc/free + mark‑and‑sweep analysis
https://github.com/KwaiAppTeam/KOOM/blob/master/koom-native-leak/README.zh-CN.md
raphael
ByteDance
Uses bytehook to hook multiple allocation/free methods
https://github.com/bytedance/memory-leak-detector
4.3.2 Practical Summary
Native OOMs are fewer than Java OOMs but still occur. We monitor native leaks online with KOOM‑Native and have identified several .so libraries with leaks.
We also track high‑frequency native OOM scenarios that are not leaks but stem from unreasonable allocations, such as large Bitmap loads after image capture.
Heap dumps revealed that loading a large Bitmap (now allocated on the native heap after Android 8.0) caused memory spikes.
We fixed the issue by reusing image objects and avoiding unnecessary rotation of the original bitmap.
4.4 Memory‑Issue Defensive Mechanism
Even after successful governance, without a solid defensive monitoring system, memory problems can deteriorate over time.
4.4.1 Existing Defensive Measures
1. MTC (Automated Test Platform) performance gate
QA performs basic performance tests on each build, but coverage is limited.
2. Online APM monitoring
APM provides Java leak and large‑object monitoring, but thresholds limit coverage and native monitoring ROI is low.
4.4.2 Offline Defensive Mechanism
We built a comprehensive offline memory monitoring system consisting of three layers:
Java, native, and thread‑leak monitoring using mature open‑source tools.
Memory churn and frequent GC monitoring via JVMTI events (GarbageCollectionStart/Finish, ObjectFree, VMObjectAlloc).
Page‑level memory rise detection using ActivityLifecycle + Debug.getPss() and slope analysis.
4.4.2.1 Reporting Layer
1. Increase problem awareness
When a memory issue is detected, the app shows a toast with the problem type and logs detailed info for developers.
2. Closed‑loop issue assignment
Combined with real‑time logs and a custom exception platform, memory issues are automatically assigned to responsible developers.
3. SDK memory‑issue gate
We added a memory‑issue check to the MTC performance report for SDK changes, creating a gate for SDK memory regressions.
5. Summary
In the past half‑year we performed extensive memory governance, achieving the following insights:
Define a clear strategy before specialized governance to prioritize based on user impact and cost.
Avoid reinventing the wheel; leverage existing mature tools and focus on problem diagnosis.
Long‑term defense is essential; robust offline monitoring prevents issues from resurfacing in production.
6. Future Work
Java OOMs are stable; continue deeper research on remaining native OOMs.
Integrate more memory‑related checks into the MTC platform to expand performance reporting dimensions.
References
KOOM – High‑Performance Online Memory Monitoring Solution
Raphael Principles and Practice (by ByteDance)
ART TI
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
