Backend Development 48 min read

How Profile‑Guided Optimization Supercharged WeChat’s Backend Services

This article details the year‑long exploration of Profile‑Guided Optimization (PGO) for WeChat’s backend, covering its theory, compiler implementations, practical experiments with Propeller and BOLT, transparent eBPF sampling, engineering challenges, and the measurable CPU and memory savings achieved across production services.

Tencent Technical Engineering

Nov 17, 2025

Introduction

The authors describe how rising compute costs in WeChat’s backend prompted a systematic study of performance optimization, focusing on Profile‑Guided Optimization (PGO) to reduce CPU and memory usage.

PGO Overview

Static Optimization Challenges

Traditional compilers (GCC, Clang, MSVC) rely on static analysis, which cannot predict runtime behavior such as branch probabilities, leading to suboptimal code layout.

PGO Principles

PGO collects a profile of actual program execution (execution counts, branch probabilities, value distributions) and feeds it back to the compiler for data‑driven optimizations.

PGO Workflow

Instrumented Compilation : compile with inserted lightweight probes.

Profiling Run : execute the instrumented binary with representative input data to generate a profile file.

Optimized Recompilation : recompile using the profile to guide optimizations.

Optimization Techniques

Basic‑Block Reordering

Rearranges hot basic blocks to improve I‑Cache hit rate and branch prediction accuracy.

Cold/Hot Code Splitting

Separates rarely executed code into distinct sections or functions to avoid cache pollution.

Function Reordering

Places frequently called functions close together in the binary to reduce instruction cache misses and TLB pressure.

Compiler Implementations

LLVM uses -fprofile-instr-generate and -fprofile-instr-use. GCC uses -fprofile-generate (producing .gcda/.gcno) and -fprofile-use. MSVC employs /GL and /LTCG:instrument with pgomgr.exe and /LTCG:use, also supporting ETW‑based sampling.

Application to WeChat Backend

Tool Comparison

Propeller (IR‑level LTO), BOLT (post‑link binary rewrite), and source‑level PGO were evaluated; Propeller and BOLT were selected for experiments.

Propeller Experiments

Optimizing a service module reduced CPU usage by ~6.5% and memory usage by ~11% under a 5 k req/s load.

BOLT Experiments

Using Perf sampling, BOLT achieved ~18% CPU reduction; eBPF sampling offered comparable results with lower overhead.

Transparent Sampling Challenge

Perf incurs high overhead (up to 62× slowdown). An eBPF program sampling Last Branch Records (LBR) was developed to collect profiles with minimal impact, enabling production‑grade optimization.

Engineering Challenges

Issues included handling GCC’s split cold/hot functions, LSDA exception tables, compressed debug sections, and new PLT formats from the mold linker. Solutions involved extending BOLT’s parser, decompressing binaries before optimization, and adding symbol‑level profiling (perf2bolt) to reuse historic samples.

Optimization Process Refinement

Historical sample reuse was attempted but proved ineffective due to address mismatches; symbol‑level profiles improved compatibility. Compatibility with diverse toolchains and preserving debug info were also addressed.

Results and Impact

Across many modules, CPU utilization dropped 5‑25%, saving over 100 k CPU cores in production. Some modules showed no gain, prompting further analysis.

Conclusion

The study demonstrates that PGO, especially when combined with binary‑level tools like BOLT and transparent eBPF sampling, can deliver substantial performance gains in large‑scale backend services, though careful engineering is required to handle toolchain diversity and sampling accuracy.

compiler eBPF PGO backend services

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.