How Kuaishou’s Kuaishou Video App Cut iOS OOM Rate by 80%: A Deep Dive
This article details the systematic approach Kuaishou’s video editing app took to define, reproduce, analyze, and resolve iOS out‑of‑memory (OOM) crashes, leveraging edit‑context capture, action tracking, memory profiling, and targeted optimizations that reduced OOM incidents by over 80%.
Background
Kuaishou’s video editing app (快影) is a professional audio‑video creation tool that can consume large amounts of memory; when memory exceeds system limits, an OOM (Out Of Memory) occurs, causing the app to crash. Rapid product iteration made online OOM a top‑priority issue, with OOM rates 5‑6 times higher than crash rates.
Define Problem
iOS OOM is triggered by the Jetsam mechanism, which forcibly kills processes based on priority and records the event in Jetsam logs. Online OOM cannot be directly captured because Jetsam logs and SIGKILL signals are unavailable. The industry standard is to use Facebook’s exclusion method to indirectly identify OOMs by filtering out known interruptions (crash, user exit, watchdog, etc.) and treating the remaining terminations as OOM.
The exclusion method defines an OOM as an uncaught SIGKILL in the foreground, which can be categorized into four cases: Jetsam OOM, EXC_RESOURCE, EXC_CRASH (e.g., thermal protection), and other causes such as kill‑9.
Analyze Problem
Two main analysis approaches exist: memory‑allocation‑stack clustering and vm_region reference‑graph clustering. Both rely on collecting live memory information but cannot analyze system‑available memory or EXC_RESOURCE/EXC_CRASH cases.
Kuaishou’s editor provides two key features: a recoverable edit context (State, Draft, Asset) and traceable user actions. By capturing the edit context (Context) and the last action (Action), the team can reproduce OOMs offline (Context(n‑1) ⊕ Action = OOM trigger).
Using this data, they aggregate OOMs by Action, Draft, and QoS dimensions. For example, a “Footprint‑increasing” OOM class was identified by clustering memory curves.
Solution Optimization
After two months of rollout, more than 20 OOM categories were identified and reproduced offline. The head‑line issues fell into four groups: app memory limit, system‑available memory shortage, crash‑process capture anomalies, and export‑marking errors.
1. App Memory Limit
Four sub‑issues were found: memory leaks, memory accumulation, oversized resources, and high memory usage. Examples include sticker‑popup memory leaks, thumbnail generation bursts, large bitmap allocations from cropping, and high‑complexity drafts causing excessive audio decoding memory.
2. System‑Available Memory Shortage
Action‑stack aggregation revealed that scrolling the effects panel on low‑memory devices caused system memory to drop, leading to Jetsam OOM despite stable app Footprint.
Investigation showed that the mediaserverd daemon allocated large buffers for pre‑decoding multiple video tracks, causing system memory exhaustion.
3. Crash‑Process Capture Anomalies
Approximately 14% of OOMs followed a subtitle‑drag action. The root cause was a floating‑point precision bug leading to a null‑pointer EXC_BAD_ACCESS, which in non‑debug builds manifested as a deadlock and was later classified as OOM by the exclusion method.
Fixes included ensuring crash monitors register first and adding early crash‑mark callbacks, which increased crash detection by over 50%.
4. Export‑Marking Errors
OOMs occurred when exporting a draft and then switching the app to background, due to insufficient storage space. A file‑handle leak in the FFmpegMuxer prevented APFS from reclaiming space, leading to repeated OOM mis‑classification.
Acceptance
AB testing over 12 weeks showed user‑reported OOM incidents dropping dramatically, with overall OOM rate reduced by more than 80% after the optimizations and further 60% after fixing exclusion‑method mis‑classifications.
Some Thoughts
The general workflow for tackling online issues is: define, analyze, solve, accept, and prevent degradation. For iOS OOM, the definition evolved from “uncaught SIGKILL in foreground” to also include exclusion‑method false positives. By leveraging edit‑context capture and action tracking, the team built a toolchain to reproduce and classify OOMs, enabling targeted fixes and robust validation through extensive AB experiments.
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
/* Mapping blank space is trivial. Use positive fds as the alias value for memory tracking. */
if (fd != -1) {
/* Use "fd" to pass (some) Mach VM allocation flags, (see the VM_FLAGS_* definitions). */
alloc_flags = fd & (VM_FLAGS_ALIAS_MASK | VM_FLAGS_SUPERPAGE_MASK | VM_FLAGS_PURGABLE);
if (alloc_flags != fd) {
/* reject if there are any extra flags */
return EINVAL;
}
}
/* vm_statistics.h */
#define VM_SET_FLAGS_ALIAS(flags, alias) \
(flags) = (((flags) & ~VM_FLAGS_ALIAS_MASK) | \
(((alias) & ~VM_FLAGS_ALIAS_MASK) << 24))
/* private raster data (i.e. layers, some images, QGL allocator) */
#define VM_MEMORY_COREGRAPHICS_DATA 54Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Frontend Engineering
Explore the cutting‑edge tech behind Kuaishou's front‑end ecosystem
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
