Unlocking C++ Performance: Surprising Memory‑Access Tricks from Baidu Engineers

This article explores Baidu C++ engineers' deep‑dive into performance optimization, covering memory‑access patterns, string‑buffer handling, protobuf merging, malloc strategies, job‑arena allocation, cache‑line effects, and modern memory‑order semantics to achieve multi‑fold speedups in large‑scale backend services.

21CTO
21CTO
21CTO
Unlocking C++ Performance: Surprising Memory‑Access Tricks from Baidu Engineers

Background

Behind Baidu's seemingly simple web UI lie massive C++ services running in data centers nationwide. Mastering low‑level C++ features and using them for performance tuning is a mandatory skill for Baidu engineers.

Re‑thinking performance optimization

Performance is not only about algorithmic complexity; constant‑factor improvements through low‑level tricks can yield order‑of‑magnitude gains.

String as a buffer

size_t some_c_style_api(char* buffer, size_t size);
void some_cxx_style_function(std::string& result) {
    result.resize(estimate_size);               // zero‑initializes
    auto acture_size = some_c_style_api(result.data(), result.size());
    result.resize(acture_size);
}

The initial resize zero‑initializes the whole buffer, which is wasteful when the C API overwrites the memory.

Solution in recent GCC versions: resize_uninitialized. Baidu provides a portable wrapper babylon::resize_uninitialized:

void some_cxx_style_function(std::string& result) {
    auto* buffer = babylon::resize_uninitialized(result, estimate_size);
    auto acture_size = some_c_style_api(buffer, result.size());
    result.resize(acture_size);
}

Split string

Typical log‑parsing splits on a single delimiter. Boost split is flexible but slower than Google's absl::StrSplit, especially when zero‑copy is possible.

std::vector<std::string> tokens;
// boost::split(tokens, str, [] (char c){return c=='\t';});
for (std::string_view sv : absl::StrSplit(str, '\t')) {
    tokens.emplace_back(sv);          // copy
}
for (std::string_view sv : absl::StrSplit(str, '\t')) {
    direct_work_on_segment(sv);      // zero‑copy
}

Baidu’s final implementation uses a custom splitter that works directly on the buffer:

babylon::split([](std::string_view sv){
    direct_work_on_segment(sv);
}, str, '\t');

Magic of protobuf

Protobuf is the primary data‑exchange format in Baidu services. Merging many partial records traditionally requires repeated parsing and re‑serialization, which is costly.

message Field { bytes column = 1; bytes value = 2; }
message Record { bytes key = 1; repeated Field field = 2; }
message Response { repeated Record record = 1; bytes error_message = 2; }

Instead of deep parsing, Baidu modifies the proxy message to store serialized sub‑records as raw strings and concatenates them:

message ProxyResponse {
    repeated string record = 1;   // raw bytes of each Record
    bytes error_message = 2;
}
// after fetching sub‑responses
final_response.mutable_record(i).append(one_sub_response.record(i));
final_response.mutable_record(i).append(another_sub_response.record(i));

Performance recap

Three major contributors to latency are algorithmic complexity, I/O overhead, and concurrency. Many Baidu case studies improve performance without changing algorithms, by exploiting CPU cache, SIMD, and memory‑access characteristics.

Memory allocation

tcmalloc vs jemalloc

Both allocators use per‑thread caches to reduce contention. tcmalloc employs a single page heap, while jemalloc uses many arenas. tcmalloc wins on low thread counts; jemalloc scales better with many cores and high allocation‑free rates.

Job arena

For request‑level jobs with a clear lifetime, allocate all dynamic memory from a dedicated arena. The arena is released in one step when the job finishes, eliminating per‑allocation contention and improving address locality.

// Example using protobuf arena
google::protobuf::Arena arena;
auto* msg = google::protobuf::Arena::CreateMessage<MyMessage>(&arena);
// use msg …
// arena is freed automatically at job end

Job reserve

Extends job arena by keeping the memory after the job ends and periodically compacting it. Containers such as std::vector<std::string> are cleared without destroying elements, then rebuilt into a new contiguous block to restore locality.

std::vector<std::string> vec;
vec.clear();               // capacity stays, elements not destroyed
// later, rebuild into fresh memory to regain contiguity

Memory access

Sequential access

Continuous address streams let hardware prefetchers load data into L1/L2 caches, dramatically reducing latency.

// Large buffer of fixed‑size vectors
std::vector<float> large_memory_buffer;
std::vector<float*> vecs;
std::shuffle(vecs.begin(), vecs.end(), rng);
for (size_t i = 0; i < vecs.size(); ++i) {
    __builtin_prefetch(vecs[i + step]);
    dot_product(vecs[i], query);
}

Benchmarks show that shuffling (non‑contiguous) plus prefetch can still beat pure contiguous access when prefetch is tuned.

Concurrent access

Cache‑line granularity (typically 64 B) is the unit of transfer. Placing unrelated data in the same line causes false sharing, inflating contention. Proper padding or struct‑splitting eliminates the problem.

Memory order

Modern C++ provides explicit memory‑order semantics. The simplest memory_order_relaxed only prevents the compiler from eliding loads/stores. memory_order_release / memory_order_acquire create a happens‑before relationship, while memory_order_seq_cst enforces a global total order.

// relaxed example
int payload = 0;
std::atomic<int> flag{0};
void relaxed_writer(int i) {
    payload = flag.load(std::memory_order_relaxed) + i;
    flag.store(1, std::memory_order_relaxed);
}
int relaxed_reader() {
    while (flag.load(std::memory_order_relaxed) == 0) {}
    return payload;
}
// release‑acquire example
void release_writer(int i) {
    payload = flag.load(std::memory_order_relaxed) + i;
    flag.store(1, std::memory_order_release);
}
int acquire_reader() {
    while (flag.load(std::memory_order_acquire) == 0) {}
    return payload;
}
// sequentially‑consistent example
void sc_writer(int i) {
    payload = i;
    flag.store(1, std::memory_order_seq_cst);
}
int sc_reader() {
    while (flag.load(std::memory_order_seq_cst) == 0) {}
    return payload;
}

Benchmarks on multi‑core CPUs show that using release / acquire instead of full seq_cst can cut contention dramatically, especially when cache‑line isolation is already applied.

Source: Baidu Geek Talk
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBackend DevelopmentconcurrencyProtobufmemory allocationC++
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.