Unlocking C++ Performance: Surprising Memory‑Access Tricks from Baidu Engineers
This article explores Baidu C++ engineers' deep‑dive into performance optimization, covering memory‑access patterns, string‑buffer handling, protobuf merging, malloc strategies, job‑arena allocation, cache‑line effects, and modern memory‑order semantics to achieve multi‑fold speedups in large‑scale backend services.
Background
Behind Baidu's seemingly simple web UI lie massive C++ services running in data centers nationwide. Mastering low‑level C++ features and using them for performance tuning is a mandatory skill for Baidu engineers.
Re‑thinking performance optimization
Performance is not only about algorithmic complexity; constant‑factor improvements through low‑level tricks can yield order‑of‑magnitude gains.
String as a buffer
size_t some_c_style_api(char* buffer, size_t size);
void some_cxx_style_function(std::string& result) {
result.resize(estimate_size); // zero‑initializes
auto acture_size = some_c_style_api(result.data(), result.size());
result.resize(acture_size);
}The initial resize zero‑initializes the whole buffer, which is wasteful when the C API overwrites the memory.
Solution in recent GCC versions: resize_uninitialized. Baidu provides a portable wrapper babylon::resize_uninitialized:
void some_cxx_style_function(std::string& result) {
auto* buffer = babylon::resize_uninitialized(result, estimate_size);
auto acture_size = some_c_style_api(buffer, result.size());
result.resize(acture_size);
}Split string
Typical log‑parsing splits on a single delimiter. Boost split is flexible but slower than Google's absl::StrSplit, especially when zero‑copy is possible.
std::vector<std::string> tokens;
// boost::split(tokens, str, [] (char c){return c=='\t';});
for (std::string_view sv : absl::StrSplit(str, '\t')) {
tokens.emplace_back(sv); // copy
}
for (std::string_view sv : absl::StrSplit(str, '\t')) {
direct_work_on_segment(sv); // zero‑copy
}Baidu’s final implementation uses a custom splitter that works directly on the buffer:
babylon::split([](std::string_view sv){
direct_work_on_segment(sv);
}, str, '\t');Magic of protobuf
Protobuf is the primary data‑exchange format in Baidu services. Merging many partial records traditionally requires repeated parsing and re‑serialization, which is costly.
message Field { bytes column = 1; bytes value = 2; }
message Record { bytes key = 1; repeated Field field = 2; }
message Response { repeated Record record = 1; bytes error_message = 2; }Instead of deep parsing, Baidu modifies the proxy message to store serialized sub‑records as raw strings and concatenates them:
message ProxyResponse {
repeated string record = 1; // raw bytes of each Record
bytes error_message = 2;
}
// after fetching sub‑responses
final_response.mutable_record(i).append(one_sub_response.record(i));
final_response.mutable_record(i).append(another_sub_response.record(i));Performance recap
Three major contributors to latency are algorithmic complexity, I/O overhead, and concurrency. Many Baidu case studies improve performance without changing algorithms, by exploiting CPU cache, SIMD, and memory‑access characteristics.
Memory allocation
tcmalloc vs jemalloc
Both allocators use per‑thread caches to reduce contention. tcmalloc employs a single page heap, while jemalloc uses many arenas. tcmalloc wins on low thread counts; jemalloc scales better with many cores and high allocation‑free rates.
Job arena
For request‑level jobs with a clear lifetime, allocate all dynamic memory from a dedicated arena. The arena is released in one step when the job finishes, eliminating per‑allocation contention and improving address locality.
// Example using protobuf arena
google::protobuf::Arena arena;
auto* msg = google::protobuf::Arena::CreateMessage<MyMessage>(&arena);
// use msg …
// arena is freed automatically at job endJob reserve
Extends job arena by keeping the memory after the job ends and periodically compacting it. Containers such as std::vector<std::string> are cleared without destroying elements, then rebuilt into a new contiguous block to restore locality.
std::vector<std::string> vec;
vec.clear(); // capacity stays, elements not destroyed
// later, rebuild into fresh memory to regain contiguityMemory access
Sequential access
Continuous address streams let hardware prefetchers load data into L1/L2 caches, dramatically reducing latency.
// Large buffer of fixed‑size vectors
std::vector<float> large_memory_buffer;
std::vector<float*> vecs;
std::shuffle(vecs.begin(), vecs.end(), rng);
for (size_t i = 0; i < vecs.size(); ++i) {
__builtin_prefetch(vecs[i + step]);
dot_product(vecs[i], query);
}Benchmarks show that shuffling (non‑contiguous) plus prefetch can still beat pure contiguous access when prefetch is tuned.
Concurrent access
Cache‑line granularity (typically 64 B) is the unit of transfer. Placing unrelated data in the same line causes false sharing, inflating contention. Proper padding or struct‑splitting eliminates the problem.
Memory order
Modern C++ provides explicit memory‑order semantics. The simplest memory_order_relaxed only prevents the compiler from eliding loads/stores. memory_order_release / memory_order_acquire create a happens‑before relationship, while memory_order_seq_cst enforces a global total order.
// relaxed example
int payload = 0;
std::atomic<int> flag{0};
void relaxed_writer(int i) {
payload = flag.load(std::memory_order_relaxed) + i;
flag.store(1, std::memory_order_relaxed);
}
int relaxed_reader() {
while (flag.load(std::memory_order_relaxed) == 0) {}
return payload;
} // release‑acquire example
void release_writer(int i) {
payload = flag.load(std::memory_order_relaxed) + i;
flag.store(1, std::memory_order_release);
}
int acquire_reader() {
while (flag.load(std::memory_order_acquire) == 0) {}
return payload;
} // sequentially‑consistent example
void sc_writer(int i) {
payload = i;
flag.store(1, std::memory_order_seq_cst);
}
int sc_reader() {
while (flag.load(std::memory_order_seq_cst) == 0) {}
return payload;
}Benchmarks on multi‑core CPUs show that using release / acquire instead of full seq_cst can cut contention dramatically, especially when cache‑line isolation is already applied.
Source: Baidu Geek Talk
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
