Mobile Development 26 min read

How We Reduced Android OOM Crashes by 99%: Mobile Memory Optimization Secrets

Over the past six months we tackled severe Android memory issues—high OOM crash rates, memory leaks, and large object usage—by implementing systematic profiling, targeted page optimizations, Java and native leak detection tools, and robust monitoring mechanisms, ultimately reducing OOM crashes from 0.8‰ to 0.01‰ and improving app stability.

Huolala Tech

May 14, 2024

How We Reduced Android OOM Crashes by 99%: Mobile Memory Optimization Secrets

1. Introduction

When discussing Android app performance, memory management is crucial. Improper memory usage leads to OOM, low app survival, UI jank, etc. Optimizing memory can significantly improve responsiveness, stability, and user experience. We have spent considerable effort addressing memory issues in the driver‑side app and share our findings.

2. Project Status and Results

2.1 Project Status

High OOM‑related crash rate

The highest online OOM crash rate reached 0.8‱, accounting for 20% of total crashes.

High‑frequency OOM pages are core pages and core business flows

Home page and the vehicle‑sticker capture page are the top two OOM pages. Frequent crashes on the home page severely affect user experience, while crashes on the sticker page block drivers from accepting orders.

Lack of defensive mechanisms and memory‑issue checkpoints

Memory leaks relied solely on developers' code quality; no offline monitoring existed, and leaks worsened over multiple releases.

2.2 Governance Results

After prolonged memory governance, the OOM‑induced crash rate dropped from a peak of 0.8‱ to 0.01‱; memory‑hit rate fell from 0.64% to 0.01%; core pages and flows now have zero OOM crashes. Effective offline defensive mechanisms intercepted many leak issues before they reached production.

3. Governance Strategy

Before performance and technical optimization, a clear strategy is essential. Based on the current problems, we defined the following optimization strategies:

3.1 Governance Phases

High‑frequency OOM page governance : Prioritize pages with the highest user impact for maximum ROI.

Java memory leak governance : Address Java‑level leaks and large object allocations that increase OOM probability over time.

Native memory leak governance : Handle native leaks last due to higher cost and lower ROI.

3.2 Defensive Phase

After the above phases, long‑term defensive mechanisms are required to prevent regression. We built a multi‑dimensional monitoring and defense system.

4. Governance Practices

4.1 High‑Frequency OOM Page Special Governance

Typical high‑frequency issues share common traits. Our approach: find common characteristics → reproduce offline → locate and fix.

4.1.1 Home Page OOM Investigation

Finding common traits

Log analysis showed that OOM cases often displayed a large number of new‑order push dialogs, suggesting a link.

Offline reproduction

We simulated continuous dialog display on the home page (one dialog every two seconds) for about 8 minutes.

Home page static, trigger new‑order dialog every 2 seconds.

Run for ~8 minutes.

Profiler showed memory increased by ~50 MB after 8 minutes with no decline, matching the online OOM pattern where the dialog was triggered >2000 times.

Root cause and fix

Analysis revealed that the dialog registered a Lifecycle observer but never deregistered it on dismiss, causing the dialog instances to remain in memory. Adding a single line of code to remove the observer on dismiss resolved the issue.

4.1.2 Vehicle‑Sticker OOM Investigation

Unlike the home page, the sticker capture case lacked obvious business clues. We added a reporting strategy combined with a “memory sponge” that dumps heap snapshots on OOM.

ByteDance memory‑sponge solution: https://juejin.cn/post/7052574440734851085

Finding common traits

Heap snapshots showed byte[] arrays occupying >90% of memory, mainly created by image‑recording classes, indicating a link to camera recording logic.

Offline verification

Simulated the capture process; although OOM was not reproduced, frequent memory churn and GC were observed.

Fix

We introduced object‑pool reuse for recording objects, greatly reducing allocations and GC frequency. After a gray‑release, the OOM issue was resolved.

4.2 Java Memory Leak Governance

Even with JVM GC, leaks occur when GC roots retain references. We focus on Activity/Fragment leaks and unreasonable large objects.

Activity/Fragment leaks

Unreleased large objects

4.2.1 Tools

We evaluated several mature tools and selected appropriate ones for online and offline use.

Name

Company

Principle

Features

GitHub

LeakCanary

Open‑source

WeakReference + GC + analysis

Low integration cost, suitable for offline use.

https://github.com/square/leakcanary

Koom

Kwai

Periodic + threshold + sub‑process dump + Shark analysis + XHook trimming

Comprehensive online monitoring.

https://github.com/KwaiAppTeam/KOOM/blob/master/koom-java-leak/README.zh-CN.md

Matrix

Tencent

ActivityLifecycleCallbacks + weak references for leak detection

Suitable for component and image leaks.

https://github.com/Tencent/matrix/wiki/Matrix-Android-ResourceCanary

Tailor

ByteDance

Heap snapshot trimming via XHook

Lightweight dump library for OOM/ANR.

https://github.com/bytedance/tailor/blob/master/README_cn.md

4.2.2 Practical Summary

Initially we integrated Koom‑Java heap monitoring online, later replaced it with an in‑house tool offering better performance and collection strategies. We adopted a “collect on OOM only” policy to minimize impact on normal users while increasing detection probability.

Polling + threshold‑based dumps affect performance; OOM‑triggered dumps are less intrusive.

Memory exceeding a threshold does not always indicate a problem; OOM‑driven dumps have higher relevance.

Data analysis revealed multiple modules with Java leaks and large object usage, many of which are core business modules with high leak frequency.

Typical Java leak scenarios include:

Handler or Thread inner classes holding outer class references.

Singletons holding interface‑type members that retain outer references.

Static variables retaining Activities.

Unregistered broadcast receivers or system services.

Third‑party SDKs receiving Activity/Fragment context.

Typical unreasonable large‑object scenarios include:

Unreleased Bitmaps after use, especially static references.

Large arrays such as Glide cache pools.

4.3 Native Memory Leak Governance

Native code requires manual allocation/release (malloc/free or new/delete). Missing a single delete can cause leaks, and third‑party .so libraries often introduce uncontrolled leaks.

#include <iostream>

void leakFunc(){
    int* p = new int(3);
    // delete p; // If omitted, memory leak occurs
}

int main() {
    leakFunc();
}

We use hook‑based tools to monitor native allocations and deallocations.

4.3.1 Tools

Common native leak detection tools:

Name

Company

Principle

GitHub

malloc debug

Android OS

Replaces libc malloc/free internally

perfetto

Android OS

Based on ftrace, atrace, heapprofd

koom

Kwai

Hook malloc/free + mark‑and‑sweep analysis

https://github.com/KwaiAppTeam/KOOM/blob/master/koom-native-leak/README.zh-CN.md

raphael

ByteDance

Uses bytehook to hook multiple allocation/free methods

https://github.com/bytedance/memory-leak-detector

4.3.2 Practical Summary

Native OOMs are fewer than Java OOMs but still occur. We monitor native leaks online with KOOM‑Native and have identified several .so libraries with leaks.

We also track high‑frequency native OOM scenarios that are not leaks but stem from unreasonable allocations, such as large Bitmap loads after image capture.

Heap dumps revealed that loading a large Bitmap (now allocated on the native heap after Android 8.0) caused memory spikes.

We fixed the issue by reusing image objects and avoiding unnecessary rotation of the original bitmap.

4.4 Memory‑Issue Defensive Mechanism

Even after successful governance, without a solid defensive monitoring system, memory problems can deteriorate over time.

4.4.1 Existing Defensive Measures

1. MTC (Automated Test Platform) performance gate

QA performs basic performance tests on each build, but coverage is limited.

2. Online APM monitoring

APM provides Java leak and large‑object monitoring, but thresholds limit coverage and native monitoring ROI is low.

4.4.2 Offline Defensive Mechanism

We built a comprehensive offline memory monitoring system consisting of three layers:

Java, native, and thread‑leak monitoring using mature open‑source tools.

Memory churn and frequent GC monitoring via JVMTI events (GarbageCollectionStart/Finish, ObjectFree, VMObjectAlloc).

Page‑level memory rise detection using ActivityLifecycle + Debug.getPss() and slope analysis.

4.4.2.1 Reporting Layer

1. Increase problem awareness

When a memory issue is detected, the app shows a toast with the problem type and logs detailed info for developers.

2. Closed‑loop issue assignment

Combined with real‑time logs and a custom exception platform, memory issues are automatically assigned to responsible developers.

3. SDK memory‑issue gate

We added a memory‑issue check to the MTC performance report for SDK changes, creating a gate for SDK memory regressions.

5. Summary

In the past half‑year we performed extensive memory governance, achieving the following insights:

Define a clear strategy before specialized governance to prioritize based on user impact and cost.

Avoid reinventing the wheel; leverage existing mature tools and focus on problem diagnosis.

Long‑term defense is essential; robust offline monitoring prevents issues from resurfacing in production.

6. Future Work

Java OOMs are stable; continue deeper research on remaining native OOMs.

Integrate more memory‑related checks into the MTC platform to expand performance reporting dimensions.

References

KOOM – High‑Performance Online Memory Monitoring Solution

Raphael Principles and Practice (by ByteDance)

ART TI

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management Android OOM leak detection

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.