Backend Development 11 min read

Memory Leak Investigation and Optimization of Protocol Buffers in a Core Backend Module

This case study details how a Baidu QA team diagnosed a long‑standing memory‑growth issue caused by improper use of Protocol Buffers, identified the problematic merged_data structure, applied a scoped_ptr reset fix, and validated the improvement with monitoring and unit tests.

Baidu Intelligent Testing

Dec 9, 2015

Memory Leak Investigation and Optimization of Protocol Buffers in a Core Backend Module

The QA team at Baidu's Quality Department compiled several bug‑analysis cases, including a memory‑growth problem in a core backend module caused by misuse of Protocol Buffers.

Protocol Buffers (protobuf) is a language‑neutral, platform‑neutral serialization mechanism that generates code from .proto definitions for Java, C++, and Python, offering compact binary format and fast parsing.

Problem Symptoms : Starting in May, the module showed slow memory increase; by September the memory rose from 40 GB to 50 GB within ~70 hours, triggering OOM alarms and frequent restarts, severely impacting product development.

Problem Reproduction : Offline QA environments could not fully reproduce the online memory growth; online memory kept rising until exhaustion, while offline memory stabilized after a few hours, making debugging more challenging.

Initial Investigation : The team first examined recent small version upgrades but found no correlation. They hypothesized that data structures were growing during runtime rather than a classic leak.

Data hot‑loading was ruled out because no data files changed in the offline environment.

Module‑by‑Module Binary Search : The module consists of 13 sub‑modules. By iteratively removing halves of the module chain, they observed a 30 % reduction in memory growth after discarding modules after a certain point, pinpointing module A as the primary suspect.

Unit‑Test Verification : Instead of long‑running manual tests, the team wrote unit tests that exercised module A, reducing verification time from a full day to about 30 minutes per version.

Monitoring and Root Cause : Monitoring revealed that the merged_data structure inside module A continuously grew; its memory increase matched the overall module growth, indicating that merged_data was the culprit.

Problem Analysis : Protobuf's clear() method only resets fields to default values without releasing the allocated memory. For large, frequently changing messages, repeated clear() leads to ever‑increasing memory consumption.

Solution : Replace the clear() call with a scoped_ptr reset that deletes the existing message and allocates a fresh one on each reset. The optimized code (shown in the original images) performs a reset() on the scoped_ptr, freeing the previous memory.

Verification : Post‑optimization graphs demonstrate a markedly slower memory increase and lower CPU usage compared to the baseline version.

CPU idle time increased, indicating reduced CPU consumption despite the added allocation overhead.

Summary :

Protobuf clear() has a cache mechanism that does not release memory; large, variable‑size messages should be deleted and recreated regularly.

Implement fine‑grained memory and CPU monitoring for each module to detect anomalies early.

Unit tests are an efficient way to validate memory‑related fixes, especially for modules where full‑system testing is costly.

Reference: http://code.google.com/p/protobuf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Debugging C Protobuf

Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.