Mobile Development 20 min read

Cutting Android Crash Rate from 5% to 0.02%: HuoLaLa’s Stability Playbook

This article details how HuoLaLa reduced its Android app crash rate from over 5% to just 0.02% through industry‑standard metrics, systematic crash analysis, code refactoring, third‑party SDK management, memory leak and OOM mitigation, and a suite of preventive tools such as gray releases, configuration systems, hot‑fixes and robust logging.

Huolala Tech

Jun 16, 2022

Cutting Android Crash Rate from 5% to 0.02%: HuoLaLa’s Stability Playbook

App crashes provide the worst user experience, leading to flow interruption, negative reputation, uninstalls, and loss of orders. When the Android crash rate exceeds 0.4%, active users decline noticeably. In early 2021 HuoLaLa’s crash rate was 5%, consuming massive R&D effort; after a year of systematic governance it fell to a stable 0.02%.

Industry Standards

Performance Metric

Excellent

Pass

Poor

Industry Reference

Crash Rate (%)

<=0.1

0.6

>=1

0.5

The target crash‑rate was set to 0.03%.

Pre‑Governance Situation

The crash rate was above 5%. Native crashes accounted for 72% (Flutter, third‑party maps, SDKs) and were hard to solve; Java crashes often remained unresolved due to legacy debt.

Crash Governance Methods

Common Crash Handling

Locate and resolve issues using stack traces, user logs, and operation paths.

Identify common patterns (device model, brand, OS version, page, user actions).

Reproduce scenarios (offline or cloud devices) to simplify fixing.

Other Handling Methods

Business‑level review, refactoring of high‑crash modules.

Communicate with third‑party SDK providers for upgrades or usage changes.

Crash Governance Practice

Because the internal statistics platform was built late, only recent crash‑rate trends are visible, showing a clear convergence.

Code Refactoring

High‑traffic pages (home, order confirmation) were built with Flutter and mini‑programs. Frequent crashes in production were hard to reproduce. By refactoring the most critical code back to native, crash rate dropped from 5% to 0.5%.

Third‑Party SDK and Native Crash Handling

Native crash rate fell from the per‑thousand level to 0.05%. Java crashes were quickly resolved, but third‑party SDK and native crashes persisted. Attempts to upgrade SDK sometimes increased crashes; downgrading was needed. Two main solutions were applied:

Check API usage (timing, order, parameters, thread/process).

Communicate with SDK vendors, sending stack traces for resolution.

2.1 VMP Thread Monitoring Causing App Exit

Many native crashes (SIGSEGV, SIGABRT) originated from system libraries (libgui.so, libGLES_mali.so, libc.so). On Vivo devices (Android 10/11) crash rates were extremely high. Attempts included disabling hardware acceleration, extensive memory‑leak and thread‑leak mitigation, lowering frame rate, and contacting the device manufacturer.

After analyzing user logs, a bug was found where VMP monitoring accessed /proc/pid/mem files, causing the app to exit and triggering a secondary crash in the map SDK. Fixing this reduced the crash rate to around 0.1%.

2.2 Map Usage

Maps are essential throughout the order flow. Most map crashes are native, some device‑specific. After working with the SDK vendor and correcting improper destroy timing, crash rate fell to 0.05%.

Out‑of‑Memory (OOM) Governance

OOM errors stem from memory leaks, large object references, memory jitter, and improper thread usage.

3.1 Memory Leaks

LeakCanary was used to detect leaks, especially in Activities. Common causes include:

Anonymous inner classes (Handler, Thread) holding outer class references.

Un‑registered listeners (EventBus, broadcast).

Singletons retaining Activity instances.

Poorly scoped network requests and async tasks.

Using Activity context where Application context suffices.

3.2 Thread Governance

Exceeding system thread limits leads to OOM. Solutions:

Unify internal thread pools; replace ad‑hoc async tasks with thread pools or RxJava.

Audit Timer and HandlerThread usage.

Coordinate with other teams to route SDK threads through the app’s pool.

Instrument replaceable threads for safer alternatives.

3.3 Large Object Handling

Images dominate memory usage. Glide was adopted for unified image loading, replacing native loaders.

Using MAT, large objects were identified and optimized, e.g., splitting large SharedPreferences data and moving seldom‑used large data to database or files.

3.4 Memory Jitter

Detected sawtooth‑shaped memory growth and high GC frequency; optimized by eliminating frequent object allocations in custom View onDraw and layout‑change listeners.

View.getViewTreeObserver().addOnGlobalLayoutListener(() -> {
    // ...
});
int[] view1Location = new int[2];
view1.getLocationOnScreen(view1Location);
int[] view2Location = new int[2];
view2.getLocationOnScreen(view2Location);
});

4. Common Crash Types

4.1 NullPointerException

Root causes include uninitialized objects, missing data from previous pages, asynchronous callbacks after component destruction, and static variable reclamation. Solutions focus on proper null checks, using Kotlin’s null‑safety, @NonNull/@Nullable annotations, timely cancellation of async tasks, and managing static variables.

4.2 IndexOutOfBoundsException

Caused by incorrect string slicing, unsynchronized list updates, and unsafe collections in multithreaded contexts. Fixes involve validating string lengths, notifying adapters of data changes, and using thread‑safe collections.

4.3 System‑Level Bugs

Example: Android 10 autofill feature triggered RemoteException in ActivityTaskManagerService. The stack trace was traced to handleRequestAssistContextExtras. The fix was to disable autofill on EditText controls.

4.4 Other Frequent Crashes

RemoteServiceException: Service.startForeground not called.

ActivityNotFoundException: No activity to handle Intent.

JavaScriptInterface callbacks causing UI operations on background threads.

MalformedJsonException from backend or local DB data.

Crash Prevention and Assist Tools

Gray Release : Flexible targeting by device ID, percentage, brand, OS version, city, etc.

App Configuration System : Remote config with gray rollout for feature toggles and A/B experiments.

Code Quality Measures : Code review, module ownership, knowledge sharing, and static code scanning.

Hot‑Fix : Immediate patching without full app release.

Logging System : Real‑time and offline logs for reproducing crash scenarios.

Summary

Address issues systematically, expanding from single problems to broader categories.

Identify root causes instead of merely adding try‑catch blocks.

Resolve crashes early in development and testing phases.

Emphasize prevention through code review, module boundaries, and technical sharing.

Control admission of new modules, SDKs, and technologies to reduce risk.

Long‑term crash governance requires continuous effort and a robust mechanism, while still encouraging the adoption of new technologies under proper safeguards.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Android memory Crash App Stability

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.