How ByteDance Scaled Profile‑Guided Optimization to Boost CPU Efficiency
This article explains ByteDance's large‑scale adoption of profile‑guided optimization (PGO), covering its principles, instrumentation and sampling methods, the automated platform built for data collection and compilation, and the resulting performance gains across dozens of critical services.
Background
As ByteDance's business expands rapidly, optimizing microservice performance becomes essential; even a few percentage points of improvement can save massive server‑resource costs.
Compiler optimization, especially profile‑guided optimization (PGO), offers broad applicability and significant performance gains, reducing overall costs. The STE team has continuously explored PGO techniques and successfully deployed them at scale within ByteDance.
PGO Overview
PGO (also known as FDO) uses runtime profiling information to generate higher‑quality code, typically delivering 10%‑15% performance improvements and up to 30% for specific workloads such as the Clang compiler itself.
Typical PGO techniques include:
Inlining : Inline small, frequently called functions based on computed thresholds and costs.
ICP (Indirect Call Promotion): Insert comparisons and jumps for hot indirect calls, improving icache hit rate and branch prediction.
Register allocation : Use runtime data for better register assignment.
Basic block optimization : Reorder frequently executed blocks to improve data locality.
Size/speed optimization : Choose performance‑oriented strategies for hot functions.
Function layout : Place functions that often execute together in the same segment.
Condition branch optimization : Arrange likely branches after comparison instructions to increase icache hits.
Memory intrinsics : Expand or optimize intrinsics like memcpy based on call frequency.
Profiling information can be obtained via two main approaches: instrumentation‑based and sample‑based.
Instrumentation‑Based PGO
Process: (1) Compile source with instrumentation to produce an instrumented binary; (2) Run the instrumented binary to generate a profile file; (3) Recompile the original source using the profile.
Instrumentation inserts counters and probes to collect execution counts and indirect call addresses, but the workflow is cumbersome for large‑scale deployment due to longer compile times, frequent code changes, and added runtime overhead that can distort the true performance characteristics.
Sampling‑Based PGO
Process: (1) Compile source with debug information; (2) Run the program with a profiler (e.g., Linux perf ) to collect binary‑level samples; (3) Translate samples to source‑line profiles and recompile using the profile.
Sampling incurs minimal overhead, tolerates stale profiles, and does not require extra compilation steps, making it more suitable for ByteDance's fast‑moving services.
System Design and Implementation
The team built an end‑to‑end platform that automates data collection, binary and symbol management, profile generation, and integration into the build pipeline. Key tasks include:
Cluster‑level business performance data collection, maintenance, and processing.
Management of production binaries and associated debug information.
Storage and querying of sampled data.
Generation of LLVM‑compatible PGO profiles.
Profile updates, performance testing, and release automation.
Binary repositories store all online binaries and their metadata; a service extracts symbol information for PGO use. The PGO platform allows users to define strategies specifying target binaries, required sample volume, and time windows, and automatically creates scheduled jobs to generate and upload profiles.
Results
PGO has been deployed to more than 30 leading ByteDance applications, achieving clear CPU and latency improvements, with an average CPU reduction exceeding 5%.
Conclusion
Large‑scale PGO adoption yields significant benefits; the STE team will continue to enhance the system, explore combined PGO/LTO optimizations, introduce post‑link optimizers, and investigate software prefetching based on sampling data to further increase ByteDance's performance gains.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.