Douyin’s Deep Dive: Expanding Android ART Heap, FD Limits & M:N Threading on Legacy Devices
This article details how Douyin engineers tackled Android’s limited heap, file‑descriptor, and thread constraints on older phones by expanding ART malloc and region spaces, enlarging FD/FD_SET limits, and implementing a transparent M:N user‑level threading model, achieving significant stability and performance gains.
Background
As Android apps evolve into “super‑apps”, older devices face severe constraints: ART heap size is often only 256 MB even with largeHeap, Android 9 and below limit a process to 1024 file descriptors, and many OEMs cap the total number of threads+processes at 500. These limits cause high OOM rates, crashes, and poor user experience.
1. Expanding ART malloc space (Android 5‑7)
1.1 Basics
ART heap consists of several spaces; the main allocation occurs in the malloc space . Different Android versions use different space types:
Android 5‑7: cms + copy gc → malloc space
Android 8‑14: cc → region space
Android 15+: cmc → bump pointer space
Douyin focused on expanding the malloc space on Android 5‑7 because it is the most common on legacy devices.
1.2 Technical solution
Restrict copy‑GC so the VM works on a single space.
Release the unused backup space and allocate a larger one.
Trigger copy‑GC to switch the main space to the new backup.
Repeat steps 1‑3 for the second space.
Modify the heap’s capacity limit.
1.2.1 Locking a space
When native code holds a Java object pointer (e.g., via GetPrimitiveArrayCritical), moving GCs must be disabled to keep the address valid. This provides a natural point to lock the current space.
art::Heap::PerformHomogeneousSpaceCompact</code><code>art::Heap::CollectGarbageInternal1.2.2 Finding expansion memory
ART uses 32‑bit compressed pointers, limiting the addressable heap to the low 4 GB of the process. Expansion must satisfy address range, card‑table mapping, contiguity, equal size for both spaces, and page‑size alignment.
1.2.3 Creating a new space
Using MapAnonymous and CreateMallocSpaceFromMemMap (found via dlsym in libart.so), a larger malloc space is created at the chosen address.
1.2.4 Replacing heap references
Heap pointers are located by scanning the runtime’s memory layout (double‑loop search) or by hooking art::Heap::ClearGrowthLimit to capture the this pointer.
1.2.5 Triggering space switch
Copy‑GC is forced via PerformHomogeneousSpaceCompact(), which swaps the main and backup spaces and updates limits.
1.3 Results
On Android 5‑6 devices the heap grew to 740‑750 MB; on Android 7 to 960‑980 MB, reducing OOM rates by 60.77 %.
2. Expanding ART region space (Android 8‑9)
2.1 Basics
Android 8+ introduced concurrent copying (CC) with region space , dividing the heap into equal 256 KB regions (free, allocated, from‑space, large‑object). CC copies live objects from from‑space to to‑space.
2.2 Technical solution
Find free contiguous memory after the existing region space within the low‑4 GB area.
Block GC/heap‑trim calls during expansion.
Inline‑hook Heap::StartGC() to create a safe window.
Perform expansion steps: stop‑the‑world, enlarge the regions array, grow the live‑bitmap, update MemMap, re‑add the region space, resume.
Trigger Heap::FinishGC().
Update heap capacity limits.
Unblock GC/trim.
2.2.1 Searching expansion memory
Two gaps are identified in the low‑4 GB area: a 299 MB “backward” gap (0x00010000‑0x12c00000) and a 476 MB “forward” gap (0x52c00000‑0x7088d000). Only the forward gap is used for stability, yielding a final heap size of ~740 MB (+45 %).
2.2.2 Expanding the regions array
The regions_ array (type Region) must be resized. Each region is 0x50 bytes on Android 8‑9; the new array is allocated, initialized, and the old data copied via memcpy.
expand_regions_size = (region_space_size / 256KB) * 0x502.2.3 Expanding the live‑bitmap
A new 8‑byte bitmap is created using SpaceBitmap<8byte>::Create, then the original bitmap data is copied (aligned to 512 byte boundaries).
2.2.4 Updating MemMap addresses
Offsets for MemMap fields are hard‑coded after runtime disassembly; they are applied with region‑size alignment.
2.2.5 Re‑adding region space to the heap
After expansion, the old space is removed and the new one added via RemoveSpace and AddSpace symbols in libart.so.
2.2.6 Restoring state
Resume the VM.
Trigger FinishGC.
Update heap capacity/growth limits.
Unblock GC/trim.
2.3 Key offset anchors
Important offsets are obtained from: RegionSpace::FromSpaceSize() – provides region size (0x50), num_regions_offset (0xb0), regions_offset (0xc0).
Heap constructor – gives MemMap offsets. DlMallocSpace::Clear – provides begin/size/base_begin/base_size offsets.
2.4 Results
FD‑related crashes dropped by 8.8 %, freezes by 4.8 %, OOM by 6.93 %, and post‑GC memory‑water‑mark >90 % reduced by 73.34 % on Android 8‑9 devices.
3. Expanding FD/FD_SET limits
3.1 Technical solution
Increase the kernel‑level FD limit via setrlimit(RLIMIT_NOFILE, …).
Override the libc fd_set size by hooking select / pselect (which ultimately call __pselect6) and providing larger buffers.
3.1.1 Expanding FD_SET in user space
All FD_SET, FD_CLR, FD_ISSET macros are redirected to checked versions that accept a size argument. Inline hooks create a peer expanded fd_set on the heap and map the original stack‑allocated set to it.
fd_set *get_expanded_fd_set(fd_set *origin_fd_set)</code><code>fd_set *add_expanded_fd_set(fd_set *origin_fd_set)</code><code>void release_expanded_fd_set(int fd)3.2 Results
FD/FD_SET overflow issues on Android 9‑ and below were virtually eliminated, reducing crashes by 7.23 %.
4. M:N Transparent User‑Level Threading
4.1 Basics
On many Android 8‑ and below devices, OEMs limit an app to 500 threads+processes. Douyin implemented a transparent M:N scheduler that multiplexes many pthreads onto a smaller number of Linux lightweight processes (LWPs).
4.2 Technical solution
Intercept clone syscall to create a transparent proxy for thread creation.
Hook pthread_exit to prevent the underlying LWP from terminating.
Use a periodic POSIX timer that sends a real‑time signal to preempt the currently running thread.
In the signal handler, capture the full thread context (general registers, pstate, tpidr_el0, floating‑point/vector registers) via ucontext_t and a custom extra_context field.
Store each thread’s context in a scheduler queue; on each timer tick, save the current context and restore the next thread’s context.
Handle non‑restartable syscalls (e.g., select, poll, nanosleep, signal‑wait syscalls) by delegating them to a dedicated daemon VCPU thread or by implementing custom restart logic.
4.3 Effects
The prototype runs up to 15 Java threads and 3 native pthreads on a single LWP, effectively bypassing the 500‑thread cap. Although timer‑based preemption adds overhead compared to native threads, it dramatically improves stability for legacy devices.
ByteDance SE Lab
Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
