Backend Development 32 min read

C++ Backend Performance Optimization at Baidu: Memory Access, Allocation, and Concurrency Techniques

This article shares Baidu C++ engineers' practical performance‑optimisation techniques, covering memory‑access patterns, custom allocation strategies, string handling, protobuf handling, cache‑line considerations and memory‑order semantics to achieve significant latency and cost reductions in large‑scale backend services.

High Availability Architecture

Apr 29, 2021

C++ Backend Performance Optimization at Baidu: Memory Access, Allocation, and Concurrency Techniques

Background

Baidu runs massive C++ services across many data centers; mastering low‑level characteristics and memory behaviour is essential for extreme performance.

Re‑examining Performance Optimization

String as a Buffer

When interfacing C‑style APIs, developers often resize a std::string to the estimated size, which zero‑initialises the buffer unnecessarily. A custom resize_uninitialized avoids this overhead.

size_t some_c_style_api(char* buffer, size_t size);</code>
<code>void some_cxx_style_function(std::string& result) {</code>
<code>    result.resize(estimate_size);</code>
<code>    auto actual_size = some_c_style_api(result.data(), result.size());</code>
<code>    result.resize(actual_size);</code>
<code>}

Split String

Using absl::StrSplit with std::string_view enables zero‑copy tokenisation, dramatically reducing temporary allocations compared with boost::split.

for (std::string_view sv : absl::StrSplit(str, '\t')) {</code>
<code>    direct_work_on_segment(sv);</code>
<code>}

Magic of Protobuf

Repeated merging of protobuf messages can be accelerated by operating on the raw wire format and appending bytes directly, avoiding full deserialization.

final_response.mutable_record(i).append(one_sub_response.record(i));</code>
<code>final_response.mutable_record(i).append(another_sub_response.record(i));

Memory Allocation Strategies

tcmalloc vs jemalloc

Both use thread‑caches; tcmalloc performs slightly better with few threads, while jemalloc scales better with many cores due to reduced lock contention.

Job Arena

Allocate a dedicated arena per job; all allocations are freed together when the job finishes, improving both contention and memory locality.

Job Reserve

Keep the arena after a job ends, periodically compacting it to restore address continuity, which benefits long‑running services.

Memory Access Optimisation

Sequential Access

Continuous memory traversal enables hardware prefetchers to load data into caches efficiently; manual __builtin_prefetch can further help in irregular patterns.

for (size_t i = 0; i < vecs.size(); ++i) {</code>
<code>    __builtin_prefetch(vecs[i + step]);</code>
<code>    dot_product(vecs[i], input);</code>
<code>}

Concurrent Access

False sharing is avoided by aligning hot data to separate cache lines (typically 64 B). Proper cache‑line isolation reduces unnecessary coherence traffic.

Cache Consistency

MESI protocol ensures coherence; write buffers allow stores to proceed without immediate stalls, while store‑buffer and invalidate‑queue ordering is controlled by memory‑order semantics.

Memory Order

Different C++ atomic memory orders (relaxed, acquire‑release, sequentially‑consistent) provide varying guarantees. Acquire‑release is sufficient for most producer‑consumer patterns, while sequential consistency adds full ordering at higher cost.

int payload = 0;</code>
<code>std::atomic<int> flag{0};</code>
<code>void release_writer(int i) {</code>
<code>    payload = flag.load(std::memory_order_relaxed) + i;</code>
<code>    flag.store(1, std::memory_order_release);</code>
<code>}</code>
<code>int acquire_reader() {</code>
<code>    while (flag.load(std::memory_order_acquire) == 0) {}</code>
<code>    return payload;</code>
<code>}

Conclusion

By combining low‑level memory‑access tuning, custom allocation arenas, zero‑copy string handling, and appropriate atomic memory orders, Baidu engineers achieve multi‑fold performance gains in large‑scale C++ backend services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization backend-development memory allocation

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.