How the Haishen Platform Detects and Resolves iOS Crashes in Real Time
This article explains the design and implementation of the Haishen crash monitoring platform for iOS, covering its system architecture, data collection, parsing, aggregation, routing, SDK features, exception handling, stack capture, startup crash detection, and upload mechanisms to quickly expose, locate, and fix crashes.
Introduction
Crash rate is a core metric for client quality, directly affecting user experience and retention. Rapid exposure, location and fixing of crashes is therefore a top priority for mobile development teams.
Project Background
Before Haishen, the Beike product line used third‑party services such as Google Fabric and Tencent Bugly. Those services suffered from difficult dSYM uploads, delayed alerts, unclear bug assignment and accumulation of historical issues, which reduced efficiency and hampered product quality. Using external platforms also exposed internal information and could not satisfy custom requirements.
To solve these problems the team defined the required capabilities for the Haishen crash monitoring module:
Route crashes to the responsible business line
Automatic alerting
Collect auxiliary data such as custom events, device info and system logs
Report custom exceptions and errors that are not crashes
Defect management
System Design
The overall architecture is divided into several functional groups.
Key components include:
Client side: crash capture, stack collection, system logs, custom events, device info, storage and upload
CI platform: dSYM files, component metadata, routing tables and business‑line information
Backend file system: business‑line data, whitelist aggregation, crash file storage and core libraries
Backend data processing: high‑performance APIs, queues, resource abstraction, processing, alerting, aggregation, routing, assignment and notification
Shared services: keOnce platform, big‑data platform and employee management platform
Data Collection and Configuration
The CI platform supplies dSYM files, symbols, libraries and routing tables linking binaries to business lines.
Haishen’s configuration module provides per‑business‑line crash aggregation strategies and whitelists.
The client embeds LJBaseCrashReporter, which gathers crash stacks, custom events, context and device info, then uploads them to the file system.
Parsing and Aggregation
The CI platform generates offset tables, symbol tables and UUIDs for each binary. When the Crash SDK reports a crash, the stack contains UUIDs, base addresses and target addresses. Haishen matches these against the symbol tables, applies whitelist rules and aggregation policies, and groups similar crashes together.
Leveraging Google’s “Error‑Prone” (EP) concept, the system can generate regression tests and automated test cases from raw crash logs combined with contextual and event data.
Routing
The routing subsystem automatically assigns a crash to the appropriate business line based on the aggregated stack and notifies the responsible owners via the alert system, enabling rapid triage.
Because iOS uses ASLR, address mapping relies on UUIDs and offsets rather than absolute addresses.
Web Dashboard
The Haishen web UI offers multi‑dimensional queries by version, time, business line, crash type and custom exceptions. Detailed views include business line, crash count, device and user info, event data, stack traces and system logs.
Client SDK Design
Overview
The SDK provides the following capabilities:
Debug panel with toggles for exception capture, local crash query and manual log upload.
Based on the open‑source KSCrash library, it reports C++ exceptions, zombies, Mach exceptions, NSException, custom user exceptions and signals.
Collects auxiliary data such as device info, custom event queries, system logs and network logs.
Allows custom agents to inject additional information or modify upload behavior.
Supports custom exception and error reporting.
Initiates crash detection and synchronously uploads crash data.
Subspec Design for Core Components
A subspec layout isolates core functionality while keeping integration cost low for specific business lines. The Test subspec is used only in automated tests to verify stability under multithreaded stress.
Core Architecture
During registration the SDK registers crash types and external delegates. When a crash occurs the SDK writes a file and invokes the registered delegate’s entry point. In the upload phase delegates can add, review or transform information before transmission.
Key protocols expose data to external callers, enabling plug‑in style insertion of auxiliary information.
Important extension points:
Add ConfigSetting to implement ExtroInfo for custom data at crash time (runtime is discouraged because the environment is suspended).
Add UploadSetting to implement UploadEmbarkation for pre‑upload inspection and modification.
Custom exception reporting triggers automatically when an NSException is created.
Custom error reporting supports cross‑platform language exceptions and manual event reporting by business lines.
Exception Capture
NSException
Register the original handler with NSGetUncaughtExceptionHandler() and set a custom block via NSSetUncaughtExceptionHandler(). The block must forward the exception to the original handler to keep other listeners functional.
C++ Exceptions
Install a custom block with std::set_terminate() and forward to the original std::terminate_handler. If set_terminate and set_unexpected are not set, the default behavior calls terminate() which eventually calls abort().
Reference: Itanium C++ ABI – Exception Handling (https://refspecs.linuxfoundation.org/abi-eh-1.22.html)
Mach Exceptions
Mach provides low‑level kernel exception ports. By acquiring a task’s exception ports via task_get_exception_ports(), inserting a new port and setting it with task_set_exception_ports, a dedicated thread can wait for exceptions, capture thread state, process information and then gracefully exit.
Signal Exceptions
Mach signals are translated to UNIX signals in the BSD layer. Some signals (e.g., SIGKILL, SIGSTOP) cannot be caught. For catchable signals, install a sigaction with sa_sigaction, handle the signal similarly to Mach exceptions and finally call raise() to terminate cleanly.
Stack Capture
Each active function occupies a contiguous memory region (stack frame). On ARM64 the frame pointer ( fp) points to the frame base and the stack pointer ( sp) points to the top. By walking the frame chain the SDK reconstructs the full call stack, attaching UUIDs and offsets for each library.
Startup Crash Detection
To catch crashes that occur during app launch, the monitoring SDK is loaded as an embedded framework (dynamic library). iOS 8+ allows multiple apps within the same process to share this library. The static and dynamic loading flows are illustrated below.
In Beike’s case a dynamic library named LJShellLaunch containing the crash SDK is added as a launch dependency. When the app starts, dyld loads LJShellLaunch, initializes runtime classes and then transfers control to main().
Synchronous Upload and Retry
Using NSURLSession background sessions (iOS 12+) crash reports are uploaded immediately after a crash. For older iOS versions the open‑source cURL library handles network transfers.
Retry logic runs on each app launch and foreground/background transition, invoking the upload API to ensure any missed reports are sent.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
