Backend Development 32 min read

Performance Optimization Techniques for Baidu C++ Backend Services: Memory Access, Allocation, and Concurrency

This article presents a comprehensive collection of Baidu C++ engineers' performance‑optimization practices, covering memory‑access patterns, string handling, protobuf manipulation, allocator choices, job‑level memory arenas, cache‑line considerations, and memory‑order semantics to achieve substantial latency and cost reductions in large‑scale backend services.

Baidu Intelligent Testing

Jul 22, 2021

Performance Optimization Techniques for Baidu C++ Backend Services: Memory Access, Allocation, and Concurrency

In Baidu's massive C++ backend infrastructure, engineers continuously seek ways to improve latency and reduce cost, leading to a set of practical performance‑optimization techniques that focus on memory access, allocation, and concurrency.

1. Rethinking performance optimization – Starting from string handling, the article shows how using std::string::data() as an uninitialized buffer avoids redundant zero‑initialization, and introduces a custom resize_uninitialized wrapper for GCC to achieve the same effect.

size_t some_c_style_api(char* buffer, size_t size);
void some_cxx_style_function(std::string& result) {
    result.resize(estimate_size);
    auto actual_size = some_c_style_api(result.data(), result.size());
    result.resize(actual_size);
}

For string splitting, the article compares boost::split with absl::StrSplit, highlighting the zero‑copy variant and a SIMD‑accelerated babylon::split implementation that yields multi‑fold speedups on modern CPUs.

babylon::split([](std::string_view sv){
    direct_work_on_segment(sv);
}, str, '\t');

2. Protobuf magic – By treating protobuf fields as raw byte strings, merging can be performed without full deserialization, using a modified message definition that stores repeated records as strings and appends new data directly.

message ProxyResponse {
    repeated string record = 1;
    bytes error_message = 2;
};
final_response.mutable_record(i).append(one_sub_response.record(i));
final_response.mutable_record(i).append(another_sub_response.record(i));

3. Memory allocation – The article contrasts tcmalloc and jemalloc, explaining how thread‑local caches reduce contention but can cause memory‑fragmentation (“shuffling”). It then proposes two job‑level allocation models:

Job arena : each job uses a dedicated arena, allocating continuously and releasing the whole arena at job end.

Job reserve : similar to job arena but retains intermediate objects and periodically compacts them to restore continuity.

Example of using a custom memory resource in Baidu's RPC framework:

babylon::ReusableRPCProtocol::register_protocol();
::baidu::rpc::ServerOptions options;
options.enabled_protocols = "baidu_std_reuse";
class SomeServiceImpl : public SomeService {
public:
    void some_method(...){
        auto* closure = static_cast<babylon::ReusableRPCProtocol::Closure*>(done);
        auto& resource = closure->memory_resource();
        std::pmr::vector<std::pmr::string> tmp_vector(&resource);
        // ...
    }
};

4. Memory access patterns – Sequential access benefits from hardware prefetchers; the article demonstrates how shuffling vector order dramatically slows down a dot‑product loop, while explicit __builtin_prefetch can recover part of the loss.

// Large buffer of float vectors
std::vector<float> large_memory_buffer;
std::vector<float*> vecs;
std::shuffle(vecs.begin(), vecs.end(), random_engine);
for (size_t i = 0; i < vecs.size(); ++i) {
    __builtin_prefetch(vecs[i + step]);
    dot_product(vecs[i], input);
}

5. Concurrency and cache‑line effects – False sharing is explained, and the importance of aligning data to separate cache lines to avoid unnecessary contention.

6. Memory order semantics – The article walks through C++ atomic memory orders (relaxed, acquire‑release, sequentially‑consistent), showing sample code and the corresponding hardware effects on x86, ARMv8, and Power architectures.

int payload = 0;
std::atomic<int> flag{0};
void release_writer(int i){
    payload = flag.load(std::memory_order_relaxed) + i;
    flag.store(1, std::memory_order_release);
}
int acquire_reader(){
    while(flag.load(std::memory_order_acquire) == 0) {}
    return payload;
}

By applying the appropriate memory‑order guarantees, Baidu engineers achieve high‑throughput lock‑free data structures while keeping correctness across diverse CPU architectures.

Overall, the collected techniques illustrate how low‑level C++ knowledge, careful allocator selection, and awareness of hardware memory behavior can together deliver multi‑fold performance improvements in large‑scale backend services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development concurrency memory allocation

Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.