Mobile Development 15 min read

How We Built a Real‑Time Crash Feedback Platform for Mobile Apps

This article details the design and implementation of a comprehensive crash feedback platform for mobile applications, covering the motivation behind replacing third‑party services, the system architecture using Flink, Kafka and HBase, crash interception on Android, automated grouping and assignment, version filtering, daily reporting, and future enhancements.

Youzan Coder
Youzan Coder
Youzan Coder
How We Built a Real‑Time Crash Feedback Platform for Mobile Apps

Background & Pain Points

Stability is a key metric for any successful company, especially on mobile. Initially, the team relied on third‑party platforms like Tencent Bugly, but long‑term maintenance issues and limitations (missed crashes, lack of assignment rules, inability to halt gray releases) prompted the need for a self‑built crash feedback system.

Crash Platform Roadmap

Build the basic infrastructure for crash collection, reporting, classification, viewing, and handling.

Add features missing from third‑party solutions such as real‑time monitoring, alerts, and daily reports.

Complete missing functionalities like crash trend statistics and symbolization.

Core Feature Set

Crash Collection : From occurrence on the device to local storage and eventual upload.

Crash Viewing : Stack trace, version distribution, page path, and operation flow to help assignees locate issues quickly.

Assignee & Status Management : Assign a crash to a specific person, send notifications, and update status after fixing.

Crash Classification : Group crashes by device, version, and error type to identify common code problems.

Alert Detection : Notify when new crashes appear in the latest version or when backend changes cause regressions.

Daily Report : Summarize daily crash counts, trends, and top‑N problematic crashes.

Overall Design

Leveraging the company’s data‑tracking infrastructure, crash data is sent through an event‑tracking channel, processed by a Flink real‑time job, grouped, monitored, and finally persisted in the business database. The platform provides browsing, assignment, and resolution workflows.

To avoid storing massive stack traces in MySQL, long fields like the crash stack are stored in HBase; MySQL keeps only the first 128 preview characters and the HBase row key.

Implementation Details

1. Crash Interception & Reporting

On Android, we replace the default Thread.UncaughtExceptionHandler with a custom handler during initialization, ensuring the original handler is still invoked.

2. Crash Collection via Data Pipeline

The data‑tracking platform’s real‑time Flink job ingests raw logs from Kafka, filters crash events, and forwards them to the crash-collection-task for further processing.

// Example Kafka consumer config (sensitive data omitted)
{
  "topic": "topic.log",
  "servers": "****",
  "type": "kafka010",
  "consumerGroup": "crash_collection_task"
}

Only messages where event_type equals "crash" are retained:

@Override
public boolean filter(String line) {
    try {
        JSONObject data = JSON.parseObject(line);
        String type = data.getString("event_type");
        return Objects.equal(type, "crash");
    } catch (Exception e) {
        System.out.println(String.format("line:[%s]: 
解析发生错误:%s", line, e.toString()));
    }
    return false;
}

2.1 Grouping & Automatic Assignee Allocation

Crashes are grouped using five dimensions: app identifier, system, crash type, crash reason, and crash page. A MD5 hash of these fields generates a groupId:

private String generateGroupId() {
    String groupKey = MD5Utils.crypt(bundleId + crashType + crashReason + pageType);
    return "Android-v4-" + groupKey;
}

Two automatic assignment strategies are employed:

Configuration‑based assignment: a JSON list maps module identifiers and key stack signatures to a responsible cas_id. When a crash stack contains a configured signature, it is routed to the corresponding owner.

Historical page assignment: if configuration matching fails, the platform falls back to assigning the crash to the person who handled the same page previously.

{
  "modules": [
    {
      "name": "xxxxSDK",
      "key_stacks": ["com.youzan.mobile.xxxx"],
      "cas_id": 10086
    }
  ]
}

2.2 Version Filtering & Daily Reporting

Only crashes from the latest full version (or a specified gray release) are forwarded to real‑time notification groups, reducing noise from older stable versions.

Daily reports contain two main sections:

Crash Trend : compares yesterday’s and today’s crash counts; a sharp increase flags potential regressions.

Top‑N Crashes : ranks crashes by impact. After experimentation, the team settled on showing the top 3 to avoid focus dilution.

The report generation originally used SpringBoot’s built‑in scheduler, later switched to the internal TSP scheduling system for easier debugging and flexible intervals.

3. Management Backend Features

The backend UI supports quick issue location by displaying recent report info, affected system versions, and affected app versions directly in the crash list.

Example: after a code change in version 4.47.0, a hidden crash resurfaced; the “affected app version” field immediately identified the problematic version.

Additional dimensions such as crash page path, full stack trace, and symbolization are provided to aid debugging.

4. Offline Log Integration

Future work includes binding device logs captured at crash time to the crash record, allowing developers to view the relevant log alongside the stack trace.

Conclusion

The crash feedback platform currently lacks a closed‑loop status workflow, and improvements are planned for more accurate grouping, Android stack symbolization, and persistent reminders for stubborn crashes. The system spans multiple technology stacks—big data (Flink, Kafka), backend services, frontend UI, and mobile SDKs—demonstrating that building such a platform requires holistic thinking beyond pure technical solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkAndroidKafkareal-time monitoringdaily reportautomated assignmentmobile crash
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.