Backend Development 31 min read

Performance Optimization in Baidu's C++ Backend: Memory Allocation and Access Techniques

Baidu engineers boost C++ backend latency and cost efficiency by eliminating unnecessary string zero‑initialization, using zero‑copy split with SIMD, replacing deep protobuf merges with repeated string fields, employing job‑scoped arenas and custom memory resources for allocation, and applying prefetching, cache‑line awareness, and tuned memory‑order semantics, achieving multiplicative to order‑of‑magnitude speedups.

Baidu Geek Talk

Apr 21, 2021

Performance Optimization in Baidu's C++ Backend: Memory Allocation and Access Techniques

Baidu's extensive C++ backend services face performance challenges due to heavy usage of C++, where mastering low‑level features is essential for latency and cost reduction. The article shares a collection of practical optimization cases accumulated by Baidu engineers, covering string handling, protobuf merging, memory allocation, and memory access patterns.

In string handling, the typical pattern of resizing a std::string before calling a C‑style API causes unnecessary zero‑initialization; a custom resize_uninitialized avoids this overhead. For split‑string scenarios, moving from boost::split or absl::StrSplit to a zero‑copy absl::StrSplit that works on std::string_view, combined with SIMD‑based delimiter detection, yields order‑of‑magnitude speedups. Protobuf optimization leverages the wire format equivalence of message fields and strings, allowing repeated string fields to replace deep protobuf objects and eliminate costly parse‑merge‑serialize cycles in micro‑service proxies.

Memory allocation is examined through the lenses of tcmalloc and jemalloc, highlighting thread‑cache trade‑offs. To further improve allocation contention and locality, Baidu introduces two job‑scoped strategies: a job arena that binds allocations to a short‑lived job’s lifetime, enabling bulk contiguous allocation and bulk release; and a job reserve that reuses allocated memory without tearing down structures, supplemented by periodic compaction and rebuilding. These ideas are realized via STL’s PMR (polymorphic memory resources) and a custom SwissMemoryResource that adapts to both STL and protobuf allocator interfaces, integrated into brpc for request‑level memory reuse.

Memory access optimization focuses on respecting hardware locality. Sequential accesses trigger CPU prefetchers, and manual __builtin_prefetch can help when data is not naturally contiguous. Cache‑line awareness prevents false sharing, while memory‑order semantics (relaxed, acquire‑release, sequentially consistent) dictate the needed fences and store‑buffer behavior on x86, ARM, and Power architectures, enabling high‑performance lock‑free queues with acquire‑release ordering and slot‑based versioning.

Applying these techniques—ranging from allocation‑level job arenas to access‑level prefetching and memory‑order tuning—has delivered multiplicative or even order‑of‑magnitude performance gains in Baidu’s services, demonstrating that deep system‑level understanding translates into measurable latency and cost improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization memory allocation cache line job arena Memory Access memory order

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.