Industry Insights 24 min read

How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One

This article details Bilibili's step‑by‑step transformation of a fragmented, high‑coupling live‑commerce system into a modular, platform‑centric architecture, covering product middle‑platform construction, unified standards, storage migration, monitoring with Prometheus/Grafana, and performance gains such as a three‑fold query speedup and a reduction of development cycles from 46 to 5 person‑days.

Architect
Architect
Architect
How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One

Background

Live‑commerce on Bilibili grew explosively after 2022, covering video, article and live‑stream formats. The original implementation was a monolithic set of tightly coupled services without a dedicated platform, making rapid business‑driven changes impossible.

Problem Identification

Root causes were:

Unclear business domain boundaries leading to ad‑hoc, one‑off feature development.

Heavy coupling between core services, creating chaotic dependencies.

Absence of a holistic solution design.

Four dimensions were defined for a balanced solution: foundational infrastructure, platform capabilities, unified standards, and cost efficiency.

Platform Architecture Overview

The new architecture is layered around five core questions:

Where does the product come from? → Product middle‑platform.

How is the product delivered? → Platform capabilities.

How is the product managed? → Operations platform.

How is the product exposed? → Joint product/advertising engine.

How are people, product and scene linked? → Standardized commerce chain.

The high‑level diagram (see image) separates the product middle‑platform, platform capabilities, interaction model and governance layers.

Product Middle‑Platform

Early product data was accessed via custom APIs for each external e‑commerce channel (Taobao, JD) and internal services, requiring code changes for every new product type. The middle‑platform abstracts the product domain, supports heterogeneous channels, and decouples product supply from business logic.

Key technical solutions:

Factory‑configurator pattern for rapid channel onboarding (e.g., JD, Taobao).

Unified product model consisting of core, extension, application and shelf tables.

Migration from Elasticsearch (ES) to MySQL for strong consistency, then building a binlog‑driven ES index for search.

Outcome: zero‑downtime data migration, read‑write separation, transactional control, and a 3× query‑performance improvement .

Channel Integration Solution

Previously each channel required a custom integration, leading to high maintenance cost and data inconsistency. The new design treats each channel as a “supply endpoint” and uses a three‑step pipeline:

Factory creates a connection based on channel configuration.

Validator checks data integrity.

Converter standardizes the payload before persisting.

Platform Capabilities

Core commerce functions (natural recommendation, traffic‑driven promotion, live‑stream commerce) were originally monolithic and interdependent. By modularizing the commerce chain (person‑product‑scene) and establishing a unified client / service / common / starter layer, cross‑service chaos was eliminated.

Interaction Model

Two approaches were evaluated for real‑time commerce attributes: a high‑throughput API versus a DB‑backed interaction table. The chosen design uses a DB table that decouples attribute updates from the main engine, enabling low‑latency subscription for downstream services.

Platform Governance & Monitoring

Stability issues included noisy alerts (≈95% useless), frequent client complaints and high‑risk deployments. A systematic monitoring stack based on Prometheus and Grafana was introduced, defining unified metrics (RT, QPS, error rate) for critical modules.

A custom SDK injected via AOP generates uniform logs. To avoid loss of high‑volume logs, storage was migrated from ES to ClickHouse , reducing cost and guaranteeing completeness.

Service Stability Improvements

After migration:

Slow‑query incidents dropped from several daily cases to near zero.

Average query latency improved 3× .

Cache memory pressure fell from ~95% to ~30% by removing large keys, adding expirations and moving to a dedicated cache cluster.

Decision Balancing

Business growth (GMV and traffic) forced parallel tracks: feature delivery continued while technical debt, monitoring, product middle‑platform and governance were tackled simultaneously. Results:

Channel‑integration effort reduced from 46 person‑days to 5 person‑days .

Daily incident cases fell from >3 to zero.

Future Evolution

Planned next steps aim to scale the platform from millions to tens of millions of SKUs and to deepen data‑center capabilities.

Strengthen the business gateway and data center for upcoming text‑and‑image commerce scenarios.

Scale the product middle‑platform to support ten‑million‑level SKU volumes.

Evolve domain‑driven models and close the attribution loop.

Continue high‑availability and performance engineering to match industry peers.

MonitoringmicroservicesScalabilityplatform architectureBilibililive commerce
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.