How Hades Platform Tackles iOS Abort Crashes: A Deep Dive into MetricKit Monitoring
This article explores Hades, the in‑house mobile monitoring platform at Huolala, detailing the challenges of capturing iOS Abort exceptions, evaluating four industry solutions, and describing the final MetricKit‑based implementation, data processing pipeline, observed benefits, and future monitoring strategy.
Introduction
Hades is Huolala's self‑developed mobile monitoring platform that, together with log monitoring and DevOps systems, supports the daily development work of all mobile engineers in the group. After nearly two years of construction, Hades processes billions of data points daily and safeguards continuous iteration across Huolala's businesses. This article focuses on the exploration and practice of Abort exception monitoring on the mobile side.
Key definitions:
Abort exception: an exception reported by HadesCrash SDK (stability monitoring) based on the system MetricKit capability.
Normal exception: a regular exception reported by HadesCrash SDK based on native crash capture.
Background
Mobile stability monitoring has always been a priority for Huolala, yet crashes occurring outside the Hades platform—such as those triggered by jetsam or watchdog mechanisms—remain a blind spot. These exceptions cause app crashes, severely harming user experience. Jetsam terminates processes by sending a SIGKILL signal, which cannot be ignored or caught by the current process, rendering traditional crash‑capture solutions ineffective.
Existing third‑party solutions like Firebase and Bugly also lack Abort exception capture. Companies such as Alibaba and ByteDance have made progress, and given that Abort exceptions can be three times more frequent than Normal exceptions, Huolala decided to develop its own Abort monitoring to close this blind spot.
Exploration
After researching multiple industry solutions, four approaches were identified:
Taobao Branch Method
Taobao records extensive metrics and exception events, uses mmap for log write performance and data consistency, and designs a binary encoding protocol for high compression before uploading logs to the cloud.
Core logged information includes performance data (CPU, memory), large memory allocations, retain cycles (for Jetsam events), stalls (for watchdog kills), and the number of active view controller instances.
Analysis workflow is illustrated below:
Summary: This solution requires logging many Abort events, generates large log volumes, demands high write performance, and involves complex cloud clustering analysis, incurring cost and potential mis‑classification.
ByteDance Targeted Method
ByteDance provides customized solutions for different event types, including:
Crash: Zombie monitoring and Coredump.
Watchdog: Thread state and dead‑lock analysis.
OOM: Self‑developed online MemoryGraph.
CPU & I/O anomalies: MetricKit.
Brief introductions to Coredump and MemoryGraph are provided for interested readers.
Summary: The approach addresses specific problems well, but both Coredump and the online MemoryGraph are proprietary and costly to reproduce.
Flag Marking Method
Some industry veterans use flag‑based recording of Abort events, typically only counting occurrences without detailed stack traces; this article does not elaborate further.
Apple Official MetricKit
MetricKit is Apple’s framework that offers APIs and tools to collect various performance metrics and crash data, such as CPU usage, memory consumption, and network quality.
Summary: Integration is simple via delegate registration, but it suffers from missing key information (e.g., exact crash time), data loss, unpredictable callback timing, and system compatibility issues.
Final Chosen Solution
Considering manpower, technical difficulty, ROI, and business characteristics, the author selected the system MetricKit capability to implement Abort exception monitoring.
Practice
On‑Device Flow
Subscription & Reception
Subscribe Metric data
// 订阅
- (void)subscribeMetricData {
if (@available(iOS 14.0, *)) {
__weak typeof(self) weakSelf = self;
self.subscribeID = [WPFMetricKitManager addSubscriber:^(NSArray * _Nonnull payload, WPFMetricPayloadType type) {
__strong typeof(weakSelf) strongSelf = weakSelf;
if (type == WPFMetricPayloadTypeDiagnostic) {
[strongSelf handleDiagnosticPayloads:payload];
}
}];
}
}
// 取消订阅
- (void)unsubscribeMetricData {
if (@available(iOS 14.0, *)) {
[WPFMetricKitManager removeSubscriber:self.subscribeID];
}
}Receive Metric data
- (void)handleDiagnosticPayloads:(NSArray<WPFDiagnosticPayload *>*)payloads API_AVAILABLE(ios(14.0)) {
dispatch_async(self.taskQueue, ^{
[self handleMetricPayloads:payloads]; // 加工数据
});
}The author wrapped MetricKit with a custom manager (WPFMetricKitManager) to achieve two goals:
Fix iOS 16.0.1/16.0.2 system bugs that cause crashes when multiple modules subscribe simultaneously.
Provide a reusable library for fast business integration without compatibility‑induced crashes.
Basic Data Processing
char eventID[37];
ksid_generate(eventID); // reuse KSCrash identifier logic
NSString *reason = nil;
if ([diagnostic isKindOfClass:[MXCrashDiagnostic class]]) {
reason = ((MXCrashDiagnostic *)diagnostic).terminationReason ?: ((MXCrashDiagnostic *)diagnostic).virtualMemoryRegionInfo;
}
NSTimeInterval crashTime = self.payload ? self.payload.timeStampBegin.timeIntervalSince1970 * 1000 : [NSDate date].timeIntervalSince1970 * 1000;
return @{ @"appId": @"XXX", @"appType": @(XX), @"clientCrashId": [NSString stringWithUTF8String:eventID], @"crashReason": reason ?: @"", @"crashTime": @(crashTime), @"crashType": @(crashType), @"sdkVersion": @"1.0.0", @"app": @{ @"channel": @"appstore", @"version": @"3.2.96" }, @"device": @{ @"deviceId": @"xxxx-xxx-xxx", @"systemVersion": @"", @"kernelVersion": @"", @"manufacturer": @"Apple", @"model": diagnostic.metaData.deviceType }, @"cpu": diagnostic.metaData.platformArchitecture, @"user": @{ @"userId": @"" }, @"run": @{} };MetricKit data lacks some crucial fields such as exact crash time, main‑process UUID, SDK version, app version, and device identifier. The red‑marked fields in the code above illustrate the added information.
Stack Processing
#if defined(__LP64__)
#define TRACE_FMT "%-4d%-31s 0x%016lx 0x%lx + %lu
"
#else
#define TRACE_FMT "%-4d%-31s 0x%08lx 0x%lx + %lu
"
#endif
@interface HadesAbortMetricRootFrame : NSObject
@property (nonatomic, copy) NSString *binaryName;
@property (nonatomic, copy) NSString *binaryUUID;
@property (nonatomic, strong) NSNumber *offsetIntoBinaryTextSegment;
@property (nonatomic, strong) NSNumber *address;
@property (nonatomic, strong) NSArray<HadesAbortMetricRootFrame *> *subFrames;
- (instancetype)initWithDictionary:(NSDictionary *)dictionary;
- (NSString *)uploadFormatString;
@end
@implementation HadesAbortMetricRootFrame
- (instancetype)initWithDictionary:(NSDictionary *)dictionary {
if (self = [super init]) {
for (NSString *property in [[self class] wpf_propertyNames]) {
id value = dictionary[property];
[self setValue:value forKey:property];
}
if (_subFrames) {
NSMutableArray *subFrames = [NSMutableArray array];
for (NSDictionary *dic in _subFrames) {
HadesAbortMetricRootFrame *frame = [[HadesAbortMetricRootFrame alloc] initWithDictionary:dic];
[subFrames addObject:frame];
}
_subFrames = subFrames;
}
}
return self;
}
- (void)uploadFormat:(NSMutableString *)uploadFormat fromFrame:(HadesAbortMetricRootFrame *)frame index:(NSInteger)index {
int num = (int)index;
uintptr_t address = frame.address.unsignedLongValue;
uintptr_t loadAddress;
uintptr_t offset;
if (@available(iOS 16.0, *)) {
num = (int)index;
offset = frame.offsetIntoBinaryTextSegment.unsignedLongValue;
loadAddress = address - offset;
} else {
num = (int)index;
loadAddress = frame.offsetIntoBinaryTextSegment.unsignedLongValue;
offset = address - loadAddress;
}
[uploadFormat appendFormat:@TRACE_FMT, num, frame.binaryName.UTF8String, address, loadAddress, offset];
for (HadesAbortMetricRootFrame *subFrame in frame.subFrames) {
[self uploadFormat:uploadFormat fromFrame:subFrame index:index+1];
}
}
@endNote: iOS 16 changed the meaning of offsetIntoBinaryTextSegment to a generic offset, which appears to be a temporary bug.
Other Considerations
Additional points worth attention:
If the app uses Mach‑O segment migration, address offsets must be corrected using the LC‑MAIN entry to obtain the main function address.
During testing, data may not be captured promptly; persisting collected data to files is recommended to avoid loss over long windows (e.g., 24‑48 hours).
Benefits
After deploying Abort exception monitoring, Huolala's enterprise iOS app captured many previously invisible crash logs, such as 0x8BADF00D watchdog events.
Metrics analysis from the enterprise version includes:
General Metric Analysis
Single‑Day Overlap Comparison
Overlap refers to Normal exceptions that are also captured as Abort exceptions on the same day. For example, on 2023‑01‑12, 38 Normal exceptions were collected, of which 33 were also captured as Abort exceptions.
Reasons Abort is not a superset of Normal:
System coverage: Abort only monitors iOS 14+ devices, while Normal covers a broader range.
Statistical method: Overlap is determined by matching user‑id or device‑id, not full stack comparison.
MetricKit data loss: Some Abort events are missed due to MetricKit’s occasional under‑reporting.
Thus, 100 % overlap is unattainable.
Abort Exception Category Distribution
Abort exceptions are divided into four major types: crash, hang, CPU, and disk I/O. In the enterprise app, only crash events appear, with Abort‑crash dominating the distribution.
Abort‑Crash Sub‑Category (Average)
Abort‑crash further splits into SIGKILL (watchdog, OOM), SIGSEGV, SIGABRT, SIGBUS, etc. SIGKILL accounts for the majority, representing the blind spot of Normal monitoring.
Benefit Metric Analysis
Single‑Day Total Comparison
The total number of user‑side stability events far exceeds business‑side events, indicating a substantial gap in user experience monitoring.
Single‑Day Increment Comparison
Increment is defined as Abort count minus Normal count. On 2023‑01‑11, 287 Abort events occurred without corresponding business impact, highlighting hidden user‑side issues.
Planning
Redefining Stability Metrics
Currently, Huolala uses Normal exceptions as the primary stability metric. Future Hades releases will provide both Normal and Abort metrics because:
Normal metrics remain for benchmarking against competitors.
Abort metrics give a more realistic view of user‑side stability.
Dual‑Sword Strategy
Abort exceptions fill the major blind spot of Normal monitoring, while Normal still captures events that Abort cannot. Combining both is the optimal short‑term solution, with the possibility of dropping Normal as system support improves.
Focus will be on SIGKILL within Abort, resulting in a final product of Normal + SIGKILL monitoring.
References
https://developer.aliyun.com/article/770060
https://mp.weixin.qq.com/s/4-4M9E8NziAgshlwB7Sc6g
https://xie.infoq.cn/article/fc1ebf4518facd24f0df61f83
https://developer.apple.com/documentation/metrickit/mxcallstacktree/3552293-jsonrepresentation?language=objc
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
