Backend Development 21 min read

Designing a High‑Performance Asynchronous Event System for Bilibili’s Like Service

Bilibili’s railgun platform transforms its high‑traffic like service into a scalable, fault‑tolerant system by moving writes to an asynchronous, Kafka‑driven pipeline, applying CQRS, partitioned processing, idempotency, hot‑key isolation, rate‑limiting, and unified SDKs, dramatically reducing database load and achieving ten‑fold throughput gains.

Bilibili Tech

Aug 11, 2023

Designing a High‑Performance Asynchronous Event System for Bilibili’s Like Service

Bilibili serves close to 100 million daily active users, resulting in extremely frequent interactions that put huge pressure on the backend. To achieve better architectural scalability, the company adopted a micro‑service + CQRS architecture. This article introduces the asynchronous event‑handling platform “railgun”, which has helped nearly 800 business applications build high‑performance, highly‑available async systems.

The discussion starts with a simple business example – a video like feature. The core requirements are counting likes and querying a user's like status.

Database schema for the like service:

CREATE TABLE `counts` (
  `id` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '主键ID',
  `video_id` bigint(20) UNSIGNED NOT NULL default 0 COMMENT '视频id',
  `count` bigint(20) NOT NULL default 0 COMMENT '点赞计数',
  `mtime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后修改时间',
  `ctime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  PRIMARY KEY (`id`)
) COMMENT='计数表';

CREATE TABLE `actions` (
  `id` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '主键ID',
  `video_id` bigint(20) UNSIGNED NOT NULL default 0 COMMENT '视频id',
  `user_id` bigint(20) UNSIGNED NOT NULL default 0 COMMENT '用户id',
  `state` tinyint(4) UNSIGNED default 0 NOT NULL COMMENT '点赞状态',
  `mtime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后修改时间',
  `ctime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  PRIMARY KEY (`id`)
) COMMENT='行为表';

The initial flow is straightforward: a stateless like service receives a request, updates the database, and returns the result. This works well under low traffic.

CPU resource problem

As traffic grows, the like service quickly hits CPU and memory limits. Because the service is stateless, horizontal scaling via Kubernetes (K8s) solves this part. However, the database becomes the bottleneck – its CPU is far more expensive to scale than stateless nodes. To alleviate pressure, a cache‑aside pattern is introduced: writes go to the DB first, then the cache is updated; reads preferentially hit the cache, falling back to the DB only on a miss.

Connection‑count problem

Heavy write traffic exhausts the DB connection pool. The solution is to add a DB‑access proxy layer so that not every service instance opens a direct DB connection.

Database lock contention

When many users like the same video simultaneously, row‑level locks cause contention and timeouts. By separating commands (writes) from queries (reads) using CQRS, the write path is moved to an asynchronous pipeline:

producer, err := railgun.NewProducer("id")
// Send Bytes
err = producer.SendBytes(ctx, "test", []byte("test"))
// Send String
err = producer.SendString(ctx, "test", "test")
// Send JSON
err = producer.SendJSON(ctx, "test", struct{ Demo string }{Demo: "test"})
// Batch send
err = producer.SendBatch(ctx, []*message.Message{{Key: "test", Value: []byte("test")}})

A Job service consumes the Kafka messages, writes to MySQL and updates the cache. By using the video ID as the Kafka key and configuring the producer’s partitioning strategy for key‑ordering, only one consumer processes a given video’s likes, eliminating lock contention.

Duplicate consumption issue

Kafka rebalance can cause the same message to be delivered twice. Idempotency is ensured by checking the “actions” table before inserting a new like. If a dedicated behavior table is absent, an external cache can be used to record processed message IDs.

Insufficient consumption capacity

Adding more Job instances did not increase throughput because the number of Kafka partitions was already saturated. After increasing the partition count, consumption speed improved, but further scaling is limited by partition‑related overhead.

Multi‑threaded processing

To boost per‑node throughput, a thread pool with multiple in‑memory queues is introduced. Each queue is bound to a dedicated worker thread, preserving order by routing messages with the same key to the same queue. However, if a node restarts, messages that were processed but not yet ACKed can be lost.

The solution is to maintain an ordered linked list of events, mark each as processed, and only ACK the Kafka offset after all preceding events have been successfully handled.

Data aggregation

Instead of updating the count table for every like, likes for the same video are aggregated in memory and flushed in batches, dramatically reducing DB write volume.

Reducing ACK overhead

Three optimizations are applied: asynchronous ACK, in‑memory offset tracking, and periodic batch ACKs. These changes increase processing speed by more than tenfold.

Flow control

When the backend DB cannot keep up with the high write rate, a hybrid rate‑limiting scheme is used: a local token‑bucket limits each node’s consumption speed, while a central controller adjusts the global token generation rate via Redis, providing high‑availability global throttling.

Hotspot event problem

Heavy write traffic on a single hot video creates data skew. The system detects hot keys and routes them to dedicated hotspot queues processed by separate threads, isolating the hot workload from the rest.

Error retry

Failures (≈0.01%) are handled with two strategies: in‑place exponential‑backoff retries and message re‑queueing to a delayed retry queue. This reduces the failure rate to near zero.

MQ failure fallback

If Kafka becomes unavailable, producers automatically switch to a degradation mode, pushing messages directly into the consumer’s in‑memory queue, ensuring continuity without a full‑blown secondary MQ.

Unified message programming model

The lessons above are abstracted into a platform that provides a unified SDK for various message systems (Kafka, Pulsar, internal Databus). Example consumer creation:

// Create processor (single or batch)
processor := single.New(unpack, do) // or batch.New(unpack, preDo, batchDo)
// Start consumer
consumer, err := railgun.NewConsumer("id", processor)

The control plane offers features such as dynamic processing mode selection, idempotency via external storage, multi‑source switching, global consumption rate management, alerting, full‑stack load testing, observability dashboards, and automated diagnostics for common issues (partition imbalance, infinite retries, consumption bottlenecks, downgrade handling, etc.). Future enhancements include finer‑grained hot‑key detection and runtime‑managed function templates.

Conclusion

By defining a unified message programming model and a flexible control plane, the railgun platform solves many common problems of high‑performance, highly‑reliable asynchronous systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend scalability Asynchronous Kafka CQRS event-driven

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.