Large-Scale C/C++ Service Compilation Performance Optimization and Platformization (OMAX)
The article details OMAX’s end‑to‑end platform for large‑scale C/C++ service compilation, covering optimization flags, profile‑guided and link‑time techniques, Facebook BOLT post‑link tuning, and real‑world results that cut CPU use, latency and deployment time while shrinking binary size.
This article presents a comprehensive study and engineering practice of compilation performance optimization for large‑scale C/C++ services. It first outlines the background: as backend C++ modules grow in size and complexity, performance degradation and high optimization cost become critical challenges.
Compilation‑time optimizations are introduced, including the classic -O0, -O1, -O2, -O3, -Ofast, -Og, -Os, and -Oz flags, as well as CPU‑specific extensions (MMX, SSE, AVX, AVX512, AMX). The article discusses when to use each flag, their trade‑offs, and how to apply global parameter configuration (e.g., using -O3 for production and -Og for development). It also covers code generation options -fpic vs -fno-pic for shared vs static libraries.
Profile‑guided optimizations (PGO) are detailed with two techniques:
FDO – compile an instrumented binary ( gcc test.c -o test_instrumented -fprofile-generate), run it to produce test.gcda, then recompile with -fprofile-use=test.gcda.
AutoFDO – collect runtime data with perf record -b -e br_inst_retired.near_taken:pp -- ./test, convert it using
create_gcov --binary=./test --profile=perf.data --gcov=test.gcov -gcov_version=1, and compile with -fauto-profile=test.gcov.
Experimental results on a sorting benchmark show that -O3 reduces execution time by 49%, while FDO and AutoFDO add an additional ~11% improvement.
Link‑time optimizations (LTO) are described, with -flto (full LTO) and -flto=thin (TinLTO) options, their benefits (global view, larger gains) and drawbacks (longer compile time).
Post‑link binary optimization with Facebook BOLT is introduced. BOLT uses perf‑collected profiles to reorder basic blocks, eliminate redundant code, and apply many peephole transformations. The article lists key BOLT passes (e.g., strip-rep-ret, icf, icp, peepholes, inline-small, reorder-bbs, reorder-functions) and reports up to 20% speedup on top of LTO/FDO, with 7% improvement on Facebook data‑center services.
The paper then describes the OMAX platform architecture that automates the entire workflow: user entry (pipeline plugins or API), optimization service (web server, task manager, workers), and data system (sampling, conversion, versioned storage). It emphasizes low integration cost, dynamic scaling, and fast release cycles (40% faster deployment, binary size reduction >10×).
Finally, the article reports real‑world impact: on Baidu’s recommendation system, OMAX achieved >10% CPU reduction, >5% latency reduction, and >40% faster rollout, while maintaining stable service performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
