Design and Implementation of Bilibili Object Storage Service (BOSS): Architecture, Topology, Metadata, Erasure Coding, and Scaling
The article chronicles Bilibili’s 13‑day development of BOSS, a custom object storage service, detailing how it replaces MySQL‑based routing and ID generation with replicated etcd or Raft KV stores, models metadata via protobuf, adopts erasure coding and a Bitcask‑style engine, and implements safe delete, replica repair, and horizontal scaling for a resilient large‑scale system.
Background
BLOB (binary large object) storage, also known as Object Storage Service (OSS), is typically used for storing files such as videos and audio. Major cloud providers offer OSS, with Amazon S3 being the de‑facto standard. Bilibili (B‑Station) has extensive OSS needs and has built its own system called BOSS (Bilibili Object Storage Service).
Day7 – Topology Single‑Point Issue
Problem
The routing information is stored in MySQL, becoming a single point of failure for the backend storage.
Two solutions are proposed:
Store routing data in etcd with three replicas for safety, add an in‑memory cache layer, and introduce a management node for topology changes.
Simplify the above by using braft to implement a KV store with three replicas, synchronizing metadata via round‑robin from Metaserver, and using raft log index as a version to avoid stale data.
Metadata
Metadata stored in the cluster includes table‑related information (shard, replica) and storage‑node information (IP, disk, zone, resource pool).
Table‑related metadata
Basic table attributes (creation time, shard count, replica count per shard).
Shard attributes (replica list, partition index).
Replica attributes (storage node details and specific disk).
These are described using protobuf files.
Storage‑node metadata
Resource pool basic info.
Number of zones per pool.
Nodes belonging to each zone.
Disks belonging to each node.
Protobuf files also describe the backend cluster distribution.
Mapping between tables and storage nodes
Logical mapping: (table_name, key) → shard → [replica0, replica1, replica2] → [addr0, addr1, addr2].
Physical mapping: resource pools → zones → DataNodes → disks, with replicas pointing to specific storage units.
Day8 – Object ID Single‑Point Issue
The current object_id generation uses MySQL auto‑increment, which is a single point. Two alternatives are discussed:
Persist progress in MySQL and pre‑allocate IDs via RPC.
Replace MySQL with a Raft‑based KV store that provides an INCR operation, eliminating the MySQL single point.
Day10 – Reducing Replication with Erasure Coding (EC)
EC Basics
Split original data into k equal‑size data blocks.
Generate m coding blocks via matrix calculations.
All blocks have the same length.
Any k out of (k+m) blocks can reconstruct the original data.
Example: data (a,b,c,d,e,f), k=6, m=3 → coding blocks (g,h,j). Any 6 blocks can recover the missing ones.
Storage Overhead
Replication factor = (k+m)/m. With k=6, m=3 the factor is 1.5, compared to 3‑replica mode.
Write Process
In EC mode, data is encoded first, then the k+m blocks are sent to distinct replicas. Only k successful writes are required.
Read Process
IO service computes the shard for the key.
It contacts the first k replicas (as described by sn_in_shard) and returns data as soon as they succeed.
Backup timers are set for the remaining m replicas; they are cancelled if the first batch succeeds.
When any k blocks are available, matrix decoding reconstructs the original data.
Impact on Repair
Repair now needs to read k blocks and recompute EC, instead of a single replica.
Choosing k and m
Large k reduces the number of zones needed but increases IOPS and latency; m controls fault tolerance. Practical choices balance zone count, I/O characteristics, and desired reliability.
Day11 – Storage Engine Choices
Common engines evaluated for the backend:
Bitcask (hash‑based, append‑only).
B/B+ tree (ordered files with multi‑level indexes).
LSM tree (multiple sorted runs, compaction).
Hybrid/optimized variants.
B/B+ Tree
Constructs a global ordered dataset; uses multi‑level indexes to reduce disk seeks. Works well for sequential key access but suffers random write overhead.
LSM Tree
Writes are appended to in‑memory structures (skip list) and later flushed as sorted files (L0, L1, …). Reads may need to search across several levels unless compaction has merged them.
Bitcask
Maintains an in‑memory hash mapping keys to (file, offset, length). Writes are append‑only; reads require a single disk I/O. Space reclamation is costly because deleted entries must be rewritten.
Given the random‑read nature of the object storage workload, a Bitcask‑style engine is chosen for its low read latency and simple implementation.
Day12 – Delete Functionality
Implementation steps:
Gateway receives Del request, obtains object_id and block ids from metadata service.
Delete entry from name table.
Remove object_id from object table.
Issue block deletions to IO service.
Problem
Repair module may resurrect a just‑deleted key.
Fix
Send a "mark‑delete" request instead of immediate delete.
Repair logic respects the mark‑delete flag and sets all replicas accordingly.
Actual deletion occurs only when all replicas are either absent or marked‑deleted.
Day13 – Horizontal Scaling and Replica Repair
Replica Repair
When a disk fails, its replicas are lost. The system creates empty replicas on healthy disks (considering zones) and the repair module copies missing data.
Horizontal Scaling
System starts at 3 replicas.
During scaling, a new replica is added (4‑replica state).
Data copy tasks run between source and target replicas.
Source replica is removed, returning to 3 replicas.
Exception Handling
If copy fails, continue writing to original replicas.
Ensure all data is migrated before gracefully retiring the old replica.
API layer must detect topology changes (e.g., via shard distribution info in responses) and update routing.
In EC mode, source and target replicas must correspond one‑to‑one because block contents differ.
Summary
Over 13 days, starting from a simple MySQL table, the article walks through the complete design and implementation of BOSS, covering topology, metadata modeling, object‑id generation, erasure coding, storage engine selection, delete handling, replica repair, and horizontal scaling. The discussion provides practical insights for building large‑scale distributed object storage systems.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.