C++ Backend Performance Optimization at Baidu: Memory Access, Allocation, and Concurrency Techniques
This article shares Baidu C++ engineers' practical performance‑optimisation techniques, covering memory‑access patterns, custom allocation strategies, string handling, protobuf handling, cache‑line considerations and memory‑order semantics to achieve significant latency and cost reductions in large‑scale backend services.
Background
Baidu runs massive C++ services across many data centers; mastering low‑level characteristics and memory behaviour is essential for extreme performance.
Re‑examining Performance Optimization
String as a Buffer
When interfacing C‑style APIs, developers often resize a std::string to the estimated size, which zero‑initialises the buffer unnecessarily. A custom resize_uninitialized avoids this overhead.
size_t some_c_style_api(char* buffer, size_t size);
void some_cxx_style_function(std::string& result) {
result.resize(estimate_size);
auto actual_size = some_c_style_api(result.data(), result.size());
result.resize(actual_size);
}Split String
Using absl::StrSplit with std::string_view enables zero‑copy tokenisation, dramatically reducing temporary allocations compared with boost::split .
for (std::string_view sv : absl::StrSplit(str, '\t')) {
direct_work_on_segment(sv);
}Magic of Protobuf
Repeated merging of protobuf messages can be accelerated by operating on the raw wire format and appending bytes directly, avoiding full deserialization.
final_response.mutable_record(i).append(one_sub_response.record(i));
final_response.mutable_record(i).append(another_sub_response.record(i));Memory Allocation Strategies
tcmalloc vs jemalloc
Both use thread‑caches; tcmalloc performs slightly better with few threads, while jemalloc scales better with many cores due to reduced lock contention.
Job Arena
Allocate a dedicated arena per job; all allocations are freed together when the job finishes, improving both contention and memory locality.
Job Reserve
Keep the arena after a job ends, periodically compacting it to restore address continuity, which benefits long‑running services.
Memory Access Optimisation
Sequential Access
Continuous memory traversal enables hardware prefetchers to load data into caches efficiently; manual __builtin_prefetch can further help in irregular patterns.
for (size_t i = 0; i < vecs.size(); ++i) {
__builtin_prefetch(vecs[i + step]);
dot_product(vecs[i], input);
}Concurrent Access
False sharing is avoided by aligning hot data to separate cache lines (typically 64 B). Proper cache‑line isolation reduces unnecessary coherence traffic.
Cache Consistency
MESI protocol ensures coherence; write buffers allow stores to proceed without immediate stalls, while store‑buffer and invalidate‑queue ordering is controlled by memory‑order semantics.
Memory Order
Different C++ atomic memory orders (relaxed, acquire‑release, sequentially‑consistent) provide varying guarantees. Acquire‑release is sufficient for most producer‑consumer patterns, while sequential consistency adds full ordering at higher cost.
int payload = 0;
std::atomic
flag{0};
void release_writer(int i) {
payload = flag.load(std::memory_order_relaxed) + i;
flag.store(1, std::memory_order_release);
}
int acquire_reader() {
while (flag.load(std::memory_order_acquire) == 0) {}
return payload;
}Conclusion
By combining low‑level memory‑access tuning, custom allocation arenas, zero‑copy string handling, and appropriate atomic memory orders, Baidu engineers achieve multi‑fold performance gains in large‑scale C++ backend services.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.