Mobile Development 19 min read

Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies

This article describes Qunar's systematic client crash governance framework, covering background challenges, APM‑based fast problem discovery, multi‑level alerting, common‑issue remediation, code‑level fixes for URL and Bundle size crashes, detection tools, code checks, automated testing, and the measurable improvements achieved in Android and iOS stability.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies

Author Introduction Jiang Baogui, architect in Qunar's large‑frontend team since 2011, focuses on mobile quality, monitoring system design, and framework upgrades.

Preface Client crash governance is a systematic solution that belongs to the observability metrics of the underlying infrastructure, aiming to quickly detect, alert, and locate client crashes, freezes, and errors to improve user experience and service availability.

Background Qunar, an online travel platform with dozens of business lines and frequent resource‑package releases (70‑80 times per day), faces challenges from rapid feature iteration, mixed tech stacks, and diverse user network conditions, making crash reduction critical.

Solution Overview During the pandemic, the framework team built a long‑term client quality assurance mechanism with three pillars: rapid problem discovery, precise alerting, and systematic remediation.

1. Fast Problem Discovery

APM collects millions of logs daily; the system extracts key stack information, aggregates by BugId, and classifies new vs. known issues.

Different exception types (Android native, SO libraries, iOS, ReactNative) are de‑obfuscated and symbolized for clear localization.

APM integrates with the MPortal build‑pack platform to map libraries, SO resources, iOS pages, and owners, enabling accurate business‑line alarm contacts.

Noise reduction via dynamic hook/xposed keyword filtering.

APM Architecture

Fine‑grained monitoring extracts lib, exception type, and key stack to generate BugId; if the BugId is new and exceeds a user‑impact threshold, alerts are sent via QTalk or phone.

Coarse‑grained monitoring (Watcher) tracks total crash volume; if today's impact exceeds 150% of the 7‑day average, a warning is issued.

Daily, bi‑weekly, and dashboard reports surface top new crashes for each business line.

2. General Issue Resolution

Patch delivery for hot‑reload capable frameworks (e.g., ReactNative).

Service‑interface compatibility for API changes (high maintenance cost).

Forced upgrades for critical bugs.

Version‑upgrade policy: force upgrade for users older than two years, optional for one‑to‑two‑year versions, and minimal prompts for recent releases.

3. Specific Technical Fixes

URL format exception

Problem: malformed URLs (e.g., missing scheme) cause IllegalArgumentException in OkHttp.

Fatal Exception: java.lang.IllegalArgumentException Expected URL scheme 'http' or 'https' but no colon was found
 okhttp3.HttpUrl$Builder.parse$okhttp (HttpUrl.kt:1260)
 ...

Solution: use Android Transform + ASM to inject URL validation into OkHttp's Builder.url() methods.

public Builder url(String url) {
    String url = HttpUtils.checkNullUrl(str); // insert null check
    if (url == null) {
        throw new NullPointerException("url == null");
    }
    // existing scheme handling
    return url(HttpUrl.get(HttpUtils.checkUrl(url))); // insert format check
}

Utility class returns a placeholder 404 URL when validation fails.

Bundle size crash (TransactionTooLargeException)

Problem: large serialized Bundle data (>100 KB) during onSaveInstanceState leads to crashes.

void recodeBundleSize(String activityName, Bundle bundle) {
    Bundle copyBundle = bundle.deepCopy();
    int totalSize = getParcelSize(copyBundle);
    Log.d("BundleSize", activityName + " totalSize:" + totalSize);
    if (totalSize > 100 * 1024) {
        for (String itemKey : copyBundle.keySet().toArray(new String[0])) {
            int itemSize = getParcelSize(bundle.get(itemKey));
            Log.d("BundleSize", activityName + " itemSize:" + itemSize);
        }
    }
}

int getParcelSize(Object data) {
    Parcel deepData = Parcel.obtain();
    try {
        deepData.writeValue(data);
        return deepData.dataPosition();
    } finally {
        deepData.recycle();
    }
}

Analysis showed that deep View hierarchy state caused the crash; targeted pages were optimized, reducing this crash to single‑digit occurrences.

4. Detection Tools Integrated LeakCanary for JS memory leaks, runtime warnings for URL misconfiguration, and custom alerts for missing listeners.

5. Code Checks Adopted SwiftLint, Sonar, ESLint, etc., to block commits with high‑severity issues; enforced pre‑release gating and gray‑release validation.

6. Automated Testing Connected build pipelines to the TARS‑UI automated test system for end‑to‑end verification of main flows.

Results Android crash rate dropped from 0.15% to ~0.02%; iOS from 0.1% to <0.02%, outperforming industry averages. The long‑term mechanism now provides rapid detection, standardized remediation, and continuous quality assurance.

Future Outlook Continue to enhance observability, AI‑driven root‑cause analysis, and a knowledge base that auto‑suggests solutions for recurring issues, further improving development efficiency and user experience.

MobileperformanceAndroidAPMObservabilityCrash Monitoring
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.