How Kuaishou Boosted Build Performance with AutoFDO, ThinLTO, BOLT, and Propeller

This article details Kuaishou's systematic compiler and build‑system optimizations—including AutoFDO, ThinLTO, BOLT, and a newly improved Propeller—showing how they reduced compilation time from hours to seconds, cut CPU usage by 10%, and achieved up to 15% performance gains while solving profile‑staleness and integration challenges.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Kuaishou Boosted Build Performance with AutoFDO, ThinLTO, BOLT, and Propeller

Overview

Kuaishou's compiler team built a high‑performance, stable, and easy‑to‑use compilation ecosystem called KBuild, which supports the majority of internal C++ services. By combining distributed caching, incremental compilation, and aggressive optimizations, they reduced full‑project build times from over an hour to under five minutes.

Key Optimizations

AutoFDO

Profile‑guided optimization (PGO) uses runtime execution profiles to guide the compiler. Kuaishou collects sample‑based profiles with perf, generates AutoFDO data, and feeds it back to the compiler to increase inline thresholds for hot functions, yielding an average 8% speed‑up.

ThinLTO

ThinLTO creates per‑module summary files during compilation, allowing parallel link‑time optimization across modules. Compared with Full LTO (which can take an hour), ThinLTO reduces link time to about ten minutes while still providing cross‑module inlining and dead‑code elimination.

BOLT

BOLT (Binary Optimization and Layout Tool) performs post‑link binary re‑layout, improving branch prediction and cache locality. By moving hot basic blocks together, BOLT reduces icache and itlb misses, delivering up to a 12% runtime improvement.

Propeller

Propeller is similar to BOLT but operates during the link phase. It clusters basic blocks based on profile hotness and reorders them to improve branch prediction and cache behavior. The original implementation was fragile: any source‑code or compiler‑flag change broke the profile mapping.

Challenges and Solutions

Two major pain points emerged:

Expired profiles caused optimization loss, especially for Propeller, which required exact basic‑block IDs.

Propeller and AutoFDO could not be applied together because AutoFDO changed the generated binary, breaking Propeller's profile matching.

To address these, Kuaishou adopted the match‑and‑infer technique from Meta's BOLT solution. The workflow consists of:

Using a stable BasicBlockHash (derived from instruction content and CFG structure) instead of volatile IDs.

Matching basic blocks between the old and new binaries with three levels of strictness (loose, strict, full).

Applying a network‑flow inference algorithm to estimate weights for unmatched blocks, ensuring accurate hot‑cold classification.

This approach allows Propeller to work with updated binaries and to be combined with AutoFDO without losing the benefits of either technique.

Results

After integrating the improved Propeller with match‑and‑infer, Kuaishou achieved:

Stable performance gains even with profiles that are weeks old (≈8% gain after one year).

Combined AutoFDO + Propeller improvements of up to 15.6% versus 13.2% for AutoFDO alone.

Overall compilation‑time reduction to ~30 seconds for link‑stage work, while maintaining a 10% CPU‑usage reduction and 4‑8% latency improvement across services.

These optimizations have been open‑sourced and merged into the LLVM project, providing a reusable best‑practice for large‑scale C++ build pipelines.

compilerBuild SystemBoltAutoFDOPropellerThinLTO
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.