Performance Optimization of Bilibili's Online Inference Service for the Effect Advertising Engine
To cope with soaring traffic on Bilibili’s effect‑advertising engine, the team systematically measured latency, eliminated redundant Redis calls, switched JSON to Protobuf, applied branch‑prediction hints, loop‑unrolling and AVX256 SIMD, introduced object‑pooling and an inverted‑index request format, cutting CPU usage by 21 % and boosting peak throughput 13 %.
As a leading Chinese video platform, Bilibili (B‑Station) is experiencing rapid growth in traffic and user scale. Under limited server resources, optimizing the performance of online services and improving resource utilization has become a critical challenge for the engineering team.
This article uses the author’s commercial technology center as a case study to discuss the online inference part of the effect advertising engine. It first introduces the project background and current system status, then details optimization strategies for performance metrics quantification, service calls, CPU computation, memory management, and network I/O, and finally summarizes the experience and future directions.
Project Background
The team is responsible for the online effect advertising engine, which drives significant revenue through precise ad delivery. The engine processes large volumes of ad candidates, performing feature extraction and model scoring (LR, FM, DNN). As the business scales, the service faces higher throughput and latency demands, especially in the inference stage that consumes about 45% of CPU resources.
System Status
The engine consists of several services: Retrieval Engine, Effect Advertising Retrieval Service, Recall/Coarse‑Ranking Service, and Inference Service. The inference service is the most resource‑intensive component.
Optimization Measures
Performance Metric Quantification
Monitoring is established via instrumentation (BRPC metrics) to capture module‑level latency, percentiles, and RPC costs. Sampling is applied to reduce measurement overhead while preserving accuracy.
Service Call Optimization
Redundant Redis accesses were eliminated by moving user‑side data fetching to the retrieval service. Data exchange format was upgraded from JSON to Protobuf3, reducing payload size and serialization cost.
CPU Compute Optimization
Techniques such as branch prediction hints, loop unrolling, data locality improvement, and SIMD vectorization were applied. Example code snippets illustrate these practices:
// Loop unrolling
for (uint32_t idx = start_idx; idx + 3 < end_idx; idx += 4) {
result[value[idx]].emplace_back(feaid, ins);
result[value[idx + 1]].emplace_back(feaid, ins);
result[value[idx + 2]].emplace_back(feaid, ins);
result[value[idx + 3]].emplace_back(feaid, ins);
}
// Process remaining iterations
for (uint32_t idx = end_idx - (end_idx - start_idx) % 4; idx < end_idx; ++idx) {
result[value[idx]].emplace_back(feaid, ins);
}Function‑pointer dispatch was used to reduce conditional branches:
typedef int64_t (AdInfo::*field_func)(void) const;
static field_func get_field_func(int field_type) {
switch (field_type) {
case 1: return &AdInfo::id1;
case 2: return &AdInfo::id2;
case 3: return &AdInfo::id3;
default: return nullptr;
}
}
auto selected_func = get_field_func(field_type);
if (selected_func != nullptr) {
for (const auto& ad_info : ad_info_list) {
auto val = (ad_info.*selected_func)();
// ...
}
}AVX256 SIMD was employed for vector dot‑product calculations:
float dot_product_avx256(const std::vector
& vec1, const std::vector
& vec2) {
if (vec1.size() != vec2.size()) return 0;
size_t vec_size = vec1.size();
size_t block_width = 8;
size_t loop_cnt = vec_size / block_width;
size_t remainder = vec_size % block_width;
__m256 sum = _mm256_setzero_ps();
for (size_t i = 0; i < loop_cnt * block_width; i += block_width) {
__m256 a = _mm256_loadu_ps(&vec1[i]);
__m256 b = _mm256_loadu_ps(&vec2[i]);
__m256 c = _mm256_mul_ps(a, b);
sum = _mm256_add_ps(sum, c);
}
__m256 hsum = _mm256_hadd_ps(sum, sum);
__m256 hsum2 = _mm256_hadd_ps(hsum, hsum);
float result[8];
_mm256_storeu_ps(result, hsum2);
float dot = result[0] + result[4];
for (size_t i = loop_cnt * block_width; i < vec_size; ++i) {
dot += vec1[i] * vec2[i];
}
return dot;
}Memory Management
SessionData objects (BRPC) were pre‑allocated and pooled. Object‑pool size is dynamically adjusted, and excess objects are reclaimed probabilistically to avoid performance jitter. This reduced runtime memory usage by 15‑22%.
Network I/O
Protobuf was already used, but further bandwidth reduction was achieved by redesigning request messages. Original repeated uint64 per ad was replaced with shared indexed arrays and finally with an inverted index structure, cutting network bandwidth by up to 50% and reducing request‑building latency by 14%.
// Original message
message Input { repeated uint64 value = 1 [packed = true]; }
message Request { repeated Input inputs = 1; }
// Improved I
message Request { repeated uint32 index = 1 [packed = true]; repeated uint64 value = 2 [packed = true]; }
// Improved II (adds shared field)
message Request { repeated uint32 index = 1 [packed = true]; repeated uint64 value = 2 [packed = true]; repeated uint64 shared = 3 [packed = true]; }
// Improved III (inverted index)
message Inverted { repeated uint64 key = 1 [packed = true]; repeated uint32 index = 2 [packed = true]; repeated uint32 value = 3 [packed = true]; repeated uint32 length = 4 [packed = true]; }
message Request { Inverted inverted = 1; }Thoughts and Outlook
Continuous multi‑round optimization has lowered CPU consumption by 21% and increased peak throughput by 13%, saving tens of thousands of CPU cores. The experience highlights the importance of systematic observation,定位, optimization, and testing, as well as cross‑team collaboration.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.