Backend Development 24 min read

Large-Scale C/C++ Service Compilation Performance Optimization and Platformization (OMAX)

The article details OMAX’s end‑to‑end platform for large‑scale C/C++ service compilation, covering optimization flags, profile‑guided and link‑time techniques, Facebook BOLT post‑link tuning, and real‑world results that cut CPU use, latency and deployment time while shrinking binary size.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Large-Scale C/C++ Service Compilation Performance Optimization and Platformization (OMAX)

This article presents a comprehensive study and engineering practice of compilation performance optimization for large‑scale C/C++ services. It first outlines the background: as backend C++ modules grow in size and complexity, performance degradation and high optimization cost become critical challenges.

Compilation‑time optimizations are introduced, including the classic -O0 , -O1 , -O2 , -O3 , -Ofast , -Og , -Os , and -Oz flags, as well as CPU‑specific extensions (MMX, SSE, AVX, AVX512, AMX). The article discusses when to use each flag, their trade‑offs, and how to apply global parameter configuration (e.g., using -O3 for production and -Og for development). It also covers code generation options -fpic vs -fno-pic for shared vs static libraries.

Profile‑guided optimizations (PGO) are detailed with two techniques:

FDO – compile an instrumented binary ( gcc test.c -o test_instrumented -fprofile-generate ), run it to produce test.gcda , then recompile with -fprofile-use=test.gcda .

AutoFDO – collect runtime data with perf record -b -e br_inst_retired.near_taken:pp -- ./test , convert it using create_gcov --binary=./test --profile=perf.data --gcov=test.gcov -gcov_version=1 , and compile with -fauto-profile=test.gcov .

Experimental results on a sorting benchmark show that -O3 reduces execution time by 49%, while FDO and AutoFDO add an additional ~11% improvement.

Link‑time optimizations (LTO) are described, with -flto (full LTO) and -flto=thin (TinLTO) options, their benefits (global view, larger gains) and drawbacks (longer compile time).

Post‑link binary optimization with Facebook BOLT is introduced. BOLT uses perf‑collected profiles to reorder basic blocks, eliminate redundant code, and apply many peephole transformations. The article lists key BOLT passes (e.g., strip-rep-ret , icf , icp , peepholes , inline-small , reorder-bbs , reorder-functions ) and reports up to 20% speedup on top of LTO/FDO, with 7% improvement on Facebook data‑center services.

The paper then describes the OMAX platform architecture that automates the entire workflow: user entry (pipeline plugins or API), optimization service (web server, task manager, workers), and data system (sampling, conversion, versioned storage). It emphasizes low integration cost, dynamic scaling, and fast release cycles (40% faster deployment, binary size reduction >10×).

Finally, the article reports real‑world impact: on Baidu’s recommendation system, OMAX achieved >10% CPU reduction, >5% latency reduction, and >40% faster rollout, while maintaining stable service performance.

Cloud ServicesC++performance engineeringPGOCompilation OptimizationLTOBOLT
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.