Backend Development 17 min read

Performance Optimization Techniques: Replacing Protobuf with C++ Classes, Cache‑Friendly Structures, jemalloc, and Lock‑Free Designs

This article presents practical performance‑optimization strategies for high‑throughput C++ services, including replacing Protobuf with hand‑written classes, adopting cache‑friendly data structures, using jemalloc/tcmalloc instead of the default allocator, employing lock‑free double‑buffer designs, tailoring data formats for specific workloads, and leveraging profiling tools to measure gains.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Performance Optimization Techniques: Replacing Protobuf with C++ Classes, Cache‑Friendly Structures, jemalloc, and Lock‑Free Designs

Performance optimization is essential for reducing costs and increasing efficiency, especially in systems handling massive API traffic; the author outlines several concrete techniques demonstrated with real‑world benchmarks.

1. Replace Protobuf with C++ classes – Protobuf’s arena allocator can cause memory fragmentation and slower allocation for large numbers of small objects. The article shows a Protobuf definition and a hand‑written equivalent C++ class, then measures destructor copy costs, revealing a ~3× speedup when using the class implementation.

message Param { optional string name = 1; optional string value = 2; }
message ParamHit { enum Type { Unknown = 0; WhiteList = 1; LaunchLayer = 2; BaseAB = 3; DefaultParam = 4; }
  optional Param param = 1;
  optional uint64 group_id = 2;
  optional uint64 expt_id = 3;
  optional uint64 launch_layer_id = 4;
  optional string hash_key_used = 5;
  optional string hash_key_val_used = 6;
  optional Type type = 7;
  optional bool is_hit_mbox = 8; }
class ParamHitInfo {
public:
  class Param { ... };
  ParamHitInfo();
  void Clear();
  const ParamHit ToProtobuf() const;
  // getters, setters, and utility methods omitted for brevity
private:
  ParamHit_Type type_;
  uint64_t group_id_, expt_id_, launch_layer_id_;
  std::string hash_key_used_, hash_key_val_used_;
  bool is_hit_mbox_;
  Param param_;
};

Benchmark code creates vectors of ParamHit (Protobuf) and ParamHitInfo (class) objects, copies them repeatedly, and reports execution time; the class version consistently outperforms the Protobuf version.

2. Cache‑friendly data structures – The author compares a naïve unordered_map based hash table with a hybrid design that stores frequently accessed keys in a contiguous vector, improving cache locality. Benchmarks show the vector‑based approach is faster for the same workload.

class HitContext {
public:
  inline void update_hash_key(const std::string &key, const std::string &val) { hash_keys_[key] = val; }
  inline const std::string* search_hash_key(const std::string &key) const {
    auto it = hash_keys_.find(key);
    return it != hash_keys_.end() ? &(it->second) : nullptr;
  }
private:
  std::unordered_map
hash_keys_;
};
class HitContext {
public:
  inline void update_hash_key(const std::string &key, uint32_t val) {
    if (Misc::IsSnsHashKey(key)) {
      uint32_t sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
      sns_hash_keys_.emplace_back(sns_id, val);
      return;
    }
    hash_keys_[key] = Misc::UInt32ToLittleEndianBytes(val);
  }
  inline const std::string search_hash_key(const std::string &key, bool &find) const {
    if (Misc::IsSnsHashKey(key)) {
      uint32_t sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
      auto it = std::find_if(sns_hash_keys_.rbegin(), sns_hash_keys_.rend(),
                             [sns_id](const std::pair
&v){ return v.first == sns_id; });
      find = it != sns_hash_keys_.rend();
      return find ? Misc::UInt32ToLittleEndianBytes(it->second) : "";
    }
    auto it = hash_keys_.find(key);
    find = it != hash_keys_.end();
    return find ? it->second : "";
  }
private:
  std::unordered_map
hash_keys_;
  std::vector
> sns_hash_keys_;
};

Performance tests using gtest confirm the vector‑based version reduces lookup latency.

3. Use jemalloc/tcmalloc – Replacing the default malloc with jemalloc reduces fragmentation, improves cache friendliness, and eliminates a global lock in multithreaded scenarios. Adding a dependency in the build file enables the allocator with minimal effort, and benchmarks show ~20% latency reduction in a load‑business workload.

cc_library(
    name = "mmexpt_dye_api",
    srcs = ["mmexpt_dye_api.cc"],
    hdrs = ["mmexpt_dye_api.h"],
    deps = ["//mm3rd/jemalloc:jemalloc"],
    copts = ["-O3", "-std=c++11"],
    visibility = ["//visibility:public"],
)

4. Lock‑free double‑buffer design – For read‑heavy scenarios with occasional writes, a two‑buffer scheme allows readers to access one buffer while writers populate the other, eliminating locks. The article provides the shared‑memory structure, initialization, switch‑to‑write/read functions, and explains the workflow.

struct expt_api_new_shm {
  void *p_shm_data;
  volatile int *p_mem_switch; // 0:uninit, 1:mem1, 2:mem2
  uint32_t *p_crc_sum;
  expt_new_context* p_new_context;
  parameter2business* p_param2business;
  char* p_business_cache;
  HashTableWithCache hash_table;
};

Functions such as InitExptNewShmData , SwitchNewShmMemToWrite , SwitchNewShmMemToRead , and ResetExptNewShmData manage the buffers.

5. Tailor data formats to the use case – In a “dye” scenario only experiment ID, group ID, and bucket source are needed. The author replaces a bulky Protobuf‑like struct with a lightweight DyeHitInfo struct, achieving noticeable speed gains.

struct DyeHitInfo {
  int expt_id, group_id;
  uint64_t bucket_src;
  DyeHitInfo(){}
  DyeHitInfo(int e, int g, uint64_t b) : expt_id(e), group_id(g), bucket_src(b) {}
  bool operator<(const DyeHitInfo &hit) const {
    if (expt_id == hit.expt_id) {
      if (group_id == hit.group_id) return bucket_src < hit.bucket_src;
      return group_id < hit.group_id;
    }
    return expt_id < hit.expt_id;
  }
  bool operator==(const DyeHitInfo &hit) { return expt_id==hit.expt_id && group_id==hit.group_id && bucket_src==hit.bucket_src; }
  std::string ToString() const {
    char buf[1024];
    sprintf(buf, "expt_id: %u, group_id: %u, bucket_src: %lu", expt_id, group_id, bucket_src);
    return std::string(buf);
  }
};

Benchmarks show the streamlined struct reduces serialization and processing overhead.

6. Performance testing tools – The author recommends Linux perf , gprof , Valgrind, strace , Godbolt for assembly inspection, and FlameGraph for visualizing hotspots.

Conclusion – Continuous monitoring and incremental optimization are crucial; avoid over‑optimizing at the expense of maintainability, and always weigh performance gains against development cost.

PerformanceOptimizationC++Protobufcache-friendlyjemalloclock-free
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.