Backend Development 28 min read

Design and Implementation of Bilibili Object Storage Service (BOSS): Architecture, Topology, Metadata, Erasure Coding, and Scaling

The article chronicles Bilibili’s 13‑day development of BOSS, a custom object storage service, detailing how it replaces MySQL‑based routing and ID generation with replicated etcd or Raft KV stores, models metadata via protobuf, adopts erasure coding and a Bitcask‑style engine, and implements safe delete, replica repair, and horizontal scaling for a resilient large‑scale system.

Bilibili Tech

May 20, 2022

Design and Implementation of Bilibili Object Storage Service (BOSS): Architecture, Topology, Metadata, Erasure Coding, and Scaling

Background

BLOB (binary large object) storage, also known as Object Storage Service (OSS), is typically used for storing files such as videos and audio. Major cloud providers offer OSS, with Amazon S3 being the de‑facto standard. Bilibili (B‑Station) has extensive OSS needs and has built its own system called BOSS (Bilibili Object Storage Service).

Day7 – Topology Single‑Point Issue

Problem

The routing information is stored in MySQL, becoming a single point of failure for the backend storage.

Two solutions are proposed:

Store routing data in etcd with three replicas for safety, add an in‑memory cache layer, and introduce a management node for topology changes.

Simplify the above by using braft to implement a KV store with three replicas, synchronizing metadata via round‑robin from Metaserver, and using raft log index as a version to avoid stale data.

Metadata

Metadata stored in the cluster includes table‑related information (shard, replica) and storage‑node information (IP, disk, zone, resource pool).

Table‑related metadata

Basic table attributes (creation time, shard count, replica count per shard).

Shard attributes (replica list, partition index).

Replica attributes (storage node details and specific disk).

These are described using protobuf files.

Storage‑node metadata

Resource pool basic info.

Number of zones per pool.

Nodes belonging to each zone.

Disks belonging to each node.

Protobuf files also describe the backend cluster distribution.

Mapping between tables and storage nodes

Logical mapping: (table_name, key) → shard → [replica0, replica1, replica2] → [addr0, addr1, addr2].

Physical mapping: resource pools → zones → DataNodes → disks, with replicas pointing to specific storage units.

Day8 – Object ID Single‑Point Issue

The current object_id generation uses MySQL auto‑increment, which is a single point. Two alternatives are discussed:

Persist progress in MySQL and pre‑allocate IDs via RPC.

Replace MySQL with a Raft‑based KV store that provides an INCR operation, eliminating the MySQL single point.

Day10 – Reducing Replication with Erasure Coding (EC)

EC Basics

Split original data into k equal‑size data blocks.

Generate m coding blocks via matrix calculations.

All blocks have the same length.

Any k out of (k+m) blocks can reconstruct the original data.

Example: data (a,b,c,d,e,f), k=6, m=3 → coding blocks (g,h,j). Any 6 blocks can recover the missing ones.

Storage Overhead

Replication factor = (k+m)/m. With k=6, m=3 the factor is 1.5, compared to 3‑replica mode.

Write Process

In EC mode, data is encoded first, then the k+m blocks are sent to distinct replicas. Only k successful writes are required.

Read Process

IO service computes the shard for the key.

It contacts the first k replicas (as described by sn_in_shard) and returns data as soon as they succeed.

Backup timers are set for the remaining m replicas; they are cancelled if the first batch succeeds.

When any k blocks are available, matrix decoding reconstructs the original data.

Impact on Repair

Repair now needs to read k blocks and recompute EC, instead of a single replica.

Choosing k and m

Large k reduces the number of zones needed but increases IOPS and latency; m controls fault tolerance. Practical choices balance zone count, I/O characteristics, and desired reliability.

Day11 – Storage Engine Choices

Common engines evaluated for the backend:

Bitcask (hash‑based, append‑only).

B/B+ tree (ordered files with multi‑level indexes).

LSM tree (multiple sorted runs, compaction).

Hybrid/optimized variants.

B/B+ Tree

Constructs a global ordered dataset; uses multi‑level indexes to reduce disk seeks. Works well for sequential key access but suffers random write overhead.

LSM Tree

Writes are appended to in‑memory structures (skip list) and later flushed as sorted files (L0, L1, …). Reads may need to search across several levels unless compaction has merged them.

Bitcask

Maintains an in‑memory hash mapping keys to (file, offset, length). Writes are append‑only; reads require a single disk I/O. Space reclamation is costly because deleted entries must be rewritten.

Given the random‑read nature of the object storage workload, a Bitcask‑style engine is chosen for its low read latency and simple implementation.

Day12 – Delete Functionality

Implementation steps:

Gateway receives Del request, obtains object_id and block ids from metadata service.

Delete entry from name table.

Remove object_id from object table.

Issue block deletions to IO service.

Problem

Repair module may resurrect a just‑deleted key.

Fix

Send a "mark‑delete" request instead of immediate delete.

Repair logic respects the mark‑delete flag and sets all replicas accordingly.

Actual deletion occurs only when all replicas are either absent or marked‑deleted.

Day13 – Horizontal Scaling and Replica Repair

Replica Repair

When a disk fails, its replicas are lost. The system creates empty replicas on healthy disks (considering zones) and the repair module copies missing data.

Horizontal Scaling

System starts at 3 replicas.

During scaling, a new replica is added (4‑replica state).

Data copy tasks run between source and target replicas.

Source replica is removed, returning to 3 replicas.

Exception Handling

If copy fails, continue writing to original replicas.

Ensure all data is migrated before gracefully retiring the old replica.

API layer must detect topology changes (e.g., via shard distribution info in responses) and update routing.

In EC mode, source and target replicas must correspond one‑to‑one because block contents differ.

Summary

Over 13 days, starting from a simple MySQL table, the article walks through the complete design and implementation of BOSS, covering topology, metadata modeling, object‑id generation, erasure coding, storage engine selection, delete handling, replica repair, and horizontal scaling. The discussion provides practical insights for building large‑scale distributed object storage systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems erasure-coding object-storage metadata design

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.