Cutting Android Crash Rate from 5% to 0.02%: HuoLaLa’s Stability Playbook
This article details how HuoLaLa reduced its Android app crash rate from over 5% to just 0.02% through industry‑standard metrics, systematic crash analysis, code refactoring, third‑party SDK management, memory leak and OOM mitigation, and a suite of preventive tools such as gray releases, configuration systems, hot‑fixes and robust logging.
App crashes provide the worst user experience, leading to flow interruption, negative reputation, uninstalls, and loss of orders. When the Android crash rate exceeds 0.4%, active users decline noticeably. In early 2021 HuoLaLa’s crash rate was 5%, consuming massive R&D effort; after a year of systematic governance it fell to a stable 0.02%.
Industry Standards
Performance Metric
Excellent
Pass
Poor
Industry Reference
Crash Rate (%)
<=0.1
0.6
>=1
0.5
The target crash‑rate was set to 0.03%.
Pre‑Governance Situation
The crash rate was above 5%. Native crashes accounted for 72% (Flutter, third‑party maps, SDKs) and were hard to solve; Java crashes often remained unresolved due to legacy debt.
Crash Governance Methods
Common Crash Handling
Locate and resolve issues using stack traces, user logs, and operation paths.
Identify common patterns (device model, brand, OS version, page, user actions).
Reproduce scenarios (offline or cloud devices) to simplify fixing.
Other Handling Methods
Business‑level review, refactoring of high‑crash modules.
Communicate with third‑party SDK providers for upgrades or usage changes.
Crash Governance Practice
Because the internal statistics platform was built late, only recent crash‑rate trends are visible, showing a clear convergence.
Code Refactoring
High‑traffic pages (home, order confirmation) were built with Flutter and mini‑programs. Frequent crashes in production were hard to reproduce. By refactoring the most critical code back to native, crash rate dropped from 5% to 0.5%.
Third‑Party SDK and Native Crash Handling
Native crash rate fell from the per‑thousand level to 0.05%. Java crashes were quickly resolved, but third‑party SDK and native crashes persisted. Attempts to upgrade SDK sometimes increased crashes; downgrading was needed. Two main solutions were applied:
Check API usage (timing, order, parameters, thread/process).
Communicate with SDK vendors, sending stack traces for resolution.
2.1 VMP Thread Monitoring Causing App Exit
Many native crashes (SIGSEGV, SIGABRT) originated from system libraries (libgui.so, libGLES_mali.so, libc.so). On Vivo devices (Android 10/11) crash rates were extremely high. Attempts included disabling hardware acceleration, extensive memory‑leak and thread‑leak mitigation, lowering frame rate, and contacting the device manufacturer.
After analyzing user logs, a bug was found where VMP monitoring accessed /proc/pid/mem files, causing the app to exit and triggering a secondary crash in the map SDK. Fixing this reduced the crash rate to around 0.1%.
2.2 Map Usage
Maps are essential throughout the order flow. Most map crashes are native, some device‑specific. After working with the SDK vendor and correcting improper destroy timing, crash rate fell to 0.05%.
Out‑of‑Memory (OOM) Governance
OOM errors stem from memory leaks, large object references, memory jitter, and improper thread usage.
3.1 Memory Leaks
LeakCanary was used to detect leaks, especially in Activities. Common causes include:
Anonymous inner classes (Handler, Thread) holding outer class references.
Un‑registered listeners (EventBus, broadcast).
Singletons retaining Activity instances.
Poorly scoped network requests and async tasks.
Using Activity context where Application context suffices.
3.2 Thread Governance
Exceeding system thread limits leads to OOM. Solutions:
Unify internal thread pools; replace ad‑hoc async tasks with thread pools or RxJava.
Audit Timer and HandlerThread usage.
Coordinate with other teams to route SDK threads through the app’s pool.
Instrument replaceable threads for safer alternatives.
3.3 Large Object Handling
Images dominate memory usage. Glide was adopted for unified image loading, replacing native loaders.
Using MAT, large objects were identified and optimized, e.g., splitting large SharedPreferences data and moving seldom‑used large data to database or files.
3.4 Memory Jitter
Detected sawtooth‑shaped memory growth and high GC frequency; optimized by eliminating frequent object allocations in custom View onDraw and layout‑change listeners.
View.getViewTreeObserver().addOnGlobalLayoutListener(() -> {
// ...
});
int[] view1Location = new int[2];
view1.getLocationOnScreen(view1Location);
int[] view2Location = new int[2];
view2.getLocationOnScreen(view2Location);
});4. Common Crash Types
4.1 NullPointerException
Root causes include uninitialized objects, missing data from previous pages, asynchronous callbacks after component destruction, and static variable reclamation. Solutions focus on proper null checks, using Kotlin’s null‑safety, @NonNull/@Nullable annotations, timely cancellation of async tasks, and managing static variables.
4.2 IndexOutOfBoundsException
Caused by incorrect string slicing, unsynchronized list updates, and unsafe collections in multithreaded contexts. Fixes involve validating string lengths, notifying adapters of data changes, and using thread‑safe collections.
4.3 System‑Level Bugs
Example: Android 10 autofill feature triggered RemoteException in ActivityTaskManagerService. The stack trace was traced to handleRequestAssistContextExtras. The fix was to disable autofill on EditText controls.
4.4 Other Frequent Crashes
RemoteServiceException: Service.startForeground not called.
ActivityNotFoundException: No activity to handle Intent.
JavaScriptInterface callbacks causing UI operations on background threads.
MalformedJsonException from backend or local DB data.
Crash Prevention and Assist Tools
Gray Release : Flexible targeting by device ID, percentage, brand, OS version, city, etc.
App Configuration System : Remote config with gray rollout for feature toggles and A/B experiments.
Code Quality Measures : Code review, module ownership, knowledge sharing, and static code scanning.
Hot‑Fix : Immediate patching without full app release.
Logging System : Real‑time and offline logs for reproducing crash scenarios.
Summary
Address issues systematically, expanding from single problems to broader categories.
Identify root causes instead of merely adding try‑catch blocks.
Resolve crashes early in development and testing phases.
Emphasize prevention through code review, module boundaries, and technical sharing.
Control admission of new modules, SDKs, and technologies to reduce risk.
Long‑term crash governance requires continuous effort and a robust mechanism, while still encouraging the adoption of new technologies under proper safeguards.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
