Mobile Development 17 min read

How the Haishen Platform Detects and Resolves iOS Crashes in Real Time

This article explains the design and implementation of the Haishen crash monitoring platform for iOS, covering its system architecture, data collection, parsing, aggregation, routing, SDK features, exception handling, stack capture, startup crash detection, and upload mechanisms to quickly expose, locate, and fix crashes.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
How the Haishen Platform Detects and Resolves iOS Crashes in Real Time

Introduction

Crash rate is a core metric for client quality, directly affecting user experience and retention. Rapid exposure, location and fixing of crashes is therefore a top priority for mobile development teams.

Project Background

Before Haishen, the Beike product line used third‑party services such as Google Fabric and Tencent Bugly. Those services suffered from difficult dSYM uploads, delayed alerts, unclear bug assignment and accumulation of historical issues, which reduced efficiency and hampered product quality. Using external platforms also exposed internal information and could not satisfy custom requirements.

To solve these problems the team defined the required capabilities for the Haishen crash monitoring module:

Route crashes to the responsible business line

Automatic alerting

Collect auxiliary data such as custom events, device info and system logs

Report custom exceptions and errors that are not crashes

Defect management

System Design

The overall architecture is divided into several functional groups.

System architecture diagram
System architecture diagram

Key components include:

Client side: crash capture, stack collection, system logs, custom events, device info, storage and upload

CI platform: dSYM files, component metadata, routing tables and business‑line information

Backend file system: business‑line data, whitelist aggregation, crash file storage and core libraries

Backend data processing: high‑performance APIs, queues, resource abstraction, processing, alerting, aggregation, routing, assignment and notification

Shared services: keOnce platform, big‑data platform and employee management platform

Data Collection and Configuration

The CI platform supplies dSYM files, symbols, libraries and routing tables linking binaries to business lines.

Haishen’s configuration module provides per‑business‑line crash aggregation strategies and whitelists.

The client embeds LJBaseCrashReporter, which gathers crash stacks, custom events, context and device info, then uploads them to the file system.

Parsing and Aggregation

The CI platform generates offset tables, symbol tables and UUIDs for each binary. When the Crash SDK reports a crash, the stack contains UUIDs, base addresses and target addresses. Haishen matches these against the symbol tables, applies whitelist rules and aggregation policies, and groups similar crashes together.

Leveraging Google’s “Error‑Prone” (EP) concept, the system can generate regression tests and automated test cases from raw crash logs combined with contextual and event data.

Routing

The routing subsystem automatically assigns a crash to the appropriate business line based on the aggregated stack and notifies the responsible owners via the alert system, enabling rapid triage.

Because iOS uses ASLR, address mapping relies on UUIDs and offsets rather than absolute addresses.

Web Dashboard

The Haishen web UI offers multi‑dimensional queries by version, time, business line, crash type and custom exceptions. Detailed views include business line, crash count, device and user info, event data, stack traces and system logs.

Dashboard overview
Dashboard overview
Dashboard detail
Dashboard detail

Client SDK Design

Overview

The SDK provides the following capabilities:

Debug panel with toggles for exception capture, local crash query and manual log upload.

Based on the open‑source KSCrash library, it reports C++ exceptions, zombies, Mach exceptions, NSException, custom user exceptions and signals.

Collects auxiliary data such as device info, custom event queries, system logs and network logs.

Allows custom agents to inject additional information or modify upload behavior.

Supports custom exception and error reporting.

Initiates crash detection and synchronously uploads crash data.

Subspec Design for Core Components

A subspec layout isolates core functionality while keeping integration cost low for specific business lines. The Test subspec is used only in automated tests to verify stability under multithreaded stress.

Subspec diagram
Subspec diagram

Core Architecture

During registration the SDK registers crash types and external delegates. When a crash occurs the SDK writes a file and invokes the registered delegate’s entry point. In the upload phase delegates can add, review or transform information before transmission.

Key protocols expose data to external callers, enabling plug‑in style insertion of auxiliary information.

Core architecture diagram
Core architecture diagram

Important extension points:

Add ConfigSetting to implement ExtroInfo for custom data at crash time (runtime is discouraged because the environment is suspended).

Add UploadSetting to implement UploadEmbarkation for pre‑upload inspection and modification.

Custom exception reporting triggers automatically when an NSException is created.

Custom error reporting supports cross‑platform language exceptions and manual event reporting by business lines.

Exception Capture

NSException

Register the original handler with NSGetUncaughtExceptionHandler() and set a custom block via NSSetUncaughtExceptionHandler(). The block must forward the exception to the original handler to keep other listeners functional.

C++ Exceptions

Install a custom block with std::set_terminate() and forward to the original std::terminate_handler. If set_terminate and set_unexpected are not set, the default behavior calls terminate() which eventually calls abort().

Reference: Itanium C++ ABI – Exception Handling (https://refspecs.linuxfoundation.org/abi-eh-1.22.html)

Mach Exceptions

Mach provides low‑level kernel exception ports. By acquiring a task’s exception ports via task_get_exception_ports(), inserting a new port and setting it with task_set_exception_ports, a dedicated thread can wait for exceptions, capture thread state, process information and then gracefully exit.

Signal Exceptions

Mach signals are translated to UNIX signals in the BSD layer. Some signals (e.g., SIGKILL, SIGSTOP) cannot be caught. For catchable signals, install a sigaction with sa_sigaction, handle the signal similarly to Mach exceptions and finally call raise() to terminate cleanly.

Exception handling flow
Exception handling flow

Stack Capture

Each active function occupies a contiguous memory region (stack frame). On ARM64 the frame pointer ( fp) points to the frame base and the stack pointer ( sp) points to the top. By walking the frame chain the SDK reconstructs the full call stack, attaching UUIDs and offsets for each library.

Stack frame diagram
Stack frame diagram

Startup Crash Detection

To catch crashes that occur during app launch, the monitoring SDK is loaded as an embedded framework (dynamic library). iOS 8+ allows multiple apps within the same process to share this library. The static and dynamic loading flows are illustrated below.

Static library loading
Static library loading
Dynamic library loading
Dynamic library loading

In Beike’s case a dynamic library named LJShellLaunch containing the crash SDK is added as a launch dependency. When the app starts, dyld loads LJShellLaunch, initializes runtime classes and then transfers control to main().

Launch dependency diagram
Launch dependency diagram

Synchronous Upload and Retry

Using NSURLSession background sessions (iOS 12+) crash reports are uploaded immediately after a crash. For older iOS versions the open‑source cURL library handles network transfers.

Retry logic runs on each app launch and foreground/background transition, invoking the upload API to ensure any missed reports are sent.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

iOSException HandlingSystem Designmobile SDKcrash monitoring
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.