How ByteDance Tackled C++ Compilation Bottlenecks and Massive Binary Bloat
ByteDance's STE team dissected the severe compile‑time delays and oversized binary artifacts in their data‑center C++ applications, presenting root‑cause analyses, LLVM bug fixes, and a suite of optimization techniques that together cut build times by up to 50% and reduced binary size by over 80%.
Background
ByteDance's data‑center C++ applications have grown in complexity, exposing two critical problems for the C++ compilation toolchain: long compilation times and excessively large binary artifacts. The STE team investigated the root causes and implemented solutions that dramatically reduced build time and binary size.
C++ Compilation Toolchain Overview
The typical toolchain consists of compiler (clang, gcc), linker (ld, gold, lld), debugger (gdb, lldb), binutils, high‑performance libraries, the STL, libc, and runtime libraries such as ASAN, unwind, and coverage instrumentation.
Challenges Faced
Large data‑center services experience compilation tail‑latency up to 60 minutes and binary sizes up to 10 GB. Specific challenges include:
Performance pressure on every toolchain component when combined with FDO/ThinLTO, ASAN, and coverage.
Upgrade difficulties due to differing compiler handling of undefined behavior, ABI incompatibilities, and multi‑architecture support.
Complexity of advanced optimizations such as ThinLTO, AutoFDO, and Propeller.
Impact of instrumentation tools (coverage, sanitizers, PGO, XRay) on binary size.
Case Study 1: GVN Bottleneck in clang 11 / gcc 8
Using -ftime-trace the team identified Global Value Numbering (GVN) consuming 1637 s in the clang 11/gcc 8 configuration, while the same pass took only 177 s with clang 16/gcc 13. The root cause was heavy use of std::char_traits::length() on constant strings, which triggered an in‑line __constant_string_p loop.
Back‑porting a GCC‑9.1 patch eliminated the loop and reduced compilation time by roughly 50 %.
Case Study 2: AutoFDO + ThinLTO
Enabling AutoFDO together with ThinLTO caused BranchProbabilityAnalysis() to dominate compile time (≈ 2/3 of total, > 100 min) due to a massive increase in basic blocks. The team contributed a cache for loop‑exit blocks to LLVM ( PR 93451 ), cutting compile time to 17.4 % of the original.
Large Binary Challenges
Binary sizes exceeding several gigabytes lead to relocation overflow errors in the linker and DWARF debug sections. The article explains the mechanics of relocation overflow and shows how excessive .text, .rodata, .bss, and .debug sections trigger out‑of‑range symbol references.
Optimization Strategies
Adjust compiler flags to shrink debug sections and enable size‑optimizing options (e.g., -Os , -Oz , -flto , -fdata‑sections , -ffunction‑sections , --gc‑sections ).
Adopt Split DWARF to separate debug information into DWO files.
Patch llvm‑dwp to handle large .debug_info sections correctly.
Disable whole‑archive linking for rarely used libraries.
Use outline instrumentation for sanitizers ( -fsanitize-address-outline-instrumentation ) and disable global variable instrumentation.
Perform code‑base cleanup guided by coverage tools.
Community Contributions
The STE team back‑ported several LLVM bugs and performance patches, including PR 88477, PR 93451, PR 96188, and PR 95771, making the fixes available to the open‑source community.
Future Directions
Ongoing work includes exploring DWARF5/64, extending BOLT/Propeller/AutoFDO integration, improving sanitizer reporting with LLMs, and leveraging ClangIR for deeper static analysis.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.