Performance Optimization Practices in Bilibili's Risk Control Engine
To overcome storage, compute, and I/O bottlenecks in Bilibili’s risk‑control engine, the team combined pre‑fetching with Redis caching, batch retrieval, asynchronous writes via Railgun, aggressive log compression, and a multi‑level cache plus Bloom filter, cutting latency to sub‑100 ms, reducing Redis QPS by over 90 % and storage by ~38 %, while supporting million‑level query throughput.
Performance optimization is a perpetual topic. As product features evolve and traffic grows, the risk control system at Bilibili encountered bottlenecks in storage, computation, and I/O. Simple horizontal scaling could not solve all problems and even increased costs. The team therefore carried out a series of optimizations—prefetch, batch, async, compression, and caching—to meet future traffic growth, reduce latency for time‑sensitive services, and lower IT costs.
Prefetch – Feature Pre‑computation
Prefetch is used together with caching to exchange space for time, further exchanging time for time by loading data in advance. In the Gaia risk engine, the factor‑calculation stage consumes >70% of request latency (≈250 ms). To meet a sub‑100 ms requirement for e‑commerce scenarios, a near‑line engine was introduced to pre‑read downstream feature data from SLB streams and store it in Redis. Subsequent requests retrieve the data directly from Redis, avoiding RPC calls. Feature cache hit rates exceed 90%, and latency for core e‑commerce transactions dropped from 80 ms to 25 ms.
Figure 2: Request execution flow in the risk engine.
Figure 3: Feature acquisition process.
By pre‑reading SLB real‑time stream data and caching results in Redis, the system achieved >90% cache hit rate and reduced interface latency from 80 ms to 25 ms for high‑traffic e‑commerce scenarios.
Batch – Feature Batch Retrieval
Batching reduces I/O by merging multiple reads/writes. In the Gaia engine, many blacklist checks (mid, buvid, ip, ua) caused read‑amplification on downstream services. The optimization merged multiple independent blacklist queries into a single batch request, combined with a local cache and single‑flight control, cutting downstream calls by 69%.
Figure 6: Batch optimization flow for blacklist factor retrieval.
Async – Accumulated‑Factor Asynchronous Computation
Accumulated factors (e.g., count, count(distinct), sum, avg) are stored in Redis and involve multiple read/write operations per request, stressing the Redis cluster. By sacrificing strict consistency for crawler‑related traffic, the team made these writes asynchronous using Bilibili’s Railgun event platform. Writes are aggregated in memory and flushed in batches, reducing Redis QPS by >35% and lowering TP99 latency.
Figure 8: Before/after async optimization for accumulated factors.
Compression – Log Storage Optimization
Risk‑control logs (≈11 KB per request, up to tens of KB) are stored in Elasticsearch for metadata and in Bilibili’s Taishan KV for full details, compressed with gzip. To further reduce storage, experiments compared json, msgpack, protobuf encodings with gzip, xz, and zstd compression. Results showed that msgpack produced shorter encoded data but gzip‑compressed size was still larger than json. Zstd without dictionary offered modest gains; with a matching dictionary, zstd achieved significantly better compression ratios, especially for single‑log compression. Batch compression further improved overall storage efficiency, achieving up to 60% reduction.
Figure 13: Memory usage of different compression algorithms.
Based on the experiments, the team adopted batch gzip compression for log details, storing batches under a BatchId key in Taishan KV. Querying retrieves BatchIds, de‑duplicates, and fetches compressed batches, reducing KV write QPS from 8 k to ~1 k and storage by ~38%.
Figure 14: Batch storage and query process for log details.
Cache – Multi‑Level Cache + Bloom Filter
To handle the massive blacklist queries (over 30 million entries), the service originally used a Cache‑Aside pattern (local cache → Redis → MySQL). As QPS grew beyond 100 k, Redis CPU and memory became bottlenecks. Introducing a Bloom Filter (BF) as a front‑layer reduced cache‑penetration dramatically. The BF is sharded into 4×Redis slots (65 536 slices) to avoid hot‑key issues, with per‑slice keys like "{libKey_bf_12345}". A local BF cache further reduces Redis calls. The BF construction pipeline (init → async build → sync test → replace) runs via Railgun scheduled tasks, with backup and fast‑restore capabilities.
Figure 16: BF construction and recovery workflow.
Operational monitoring shows that ~95% of queries are negative (BF filters them out), and the overall system now supports million‑level QPS with dramatically reduced Redis CPU (from 42% to 4%) and memory usage (from 256 GB to 50 GB).
Summary
The article demonstrates a variety of performance‑optimization techniques applied to a large‑scale risk‑control system. Each technique brings trade‑offs; the key is to balance gains against business acceptability and maintainability, following the principle of “avoid premature or excessive optimization.”
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.