Big Data 14 min read

OPPO CBFS: Architecture and Key Technologies of a Scalable Data Lake Storage System

This article introduces OPPO's self‑developed data lake storage system CBFS, covering the fundamentals of data lake storage, the multi‑layer CBFS architecture, its core technologies such as metadata management and erasure coding, and future directions for large‑scale, low‑cost data analytics.

DataFunTalk
DataFunTalk
DataFunTalk
OPPO CBFS: Architecture and Key Technologies of a Scalable Data Lake Storage System

Introduction

OPPO, a major smart‑device manufacturer, generates massive amounts of data from phones and IoT devices. To enable intelligent services, OPPO needs deep data mining, which requires low‑cost, high‑efficiency storage solutions. The industry trend is to adopt data lakes, and OPPO's in‑house data lake storage system CBFS addresses many of the current pain points.

1. Data Lake Overview

A data lake is a centralized repository that stores raw data in its original binary or file format, along with transformed data for reporting, visualization, advanced analytics, and machine learning. Its main values are high flexibility, support for multiple analytics workloads, low cost through object storage with hot‑cold separation, and comprehensive management and audit capabilities.

2. OPPO Data Lake Solution

OPPO builds its data lake on three layers: the bottom layer is the lake storage (CBFS) supporting S3, HDFS, and POSIX protocols; the middle layer uses Iceberg for real‑time data formats; the top layer provides compatibility with various compute engines.

3. CBFS Architecture

The CBFS architecture consists of six subsystems:

Protocol Access Layer – supports S3, HDFS, and POSIX access.

Metadata Layer – distributed hierarchical namespace and flat object namespace, sharded and linearly scalable.

Metadata Cache Layer – accelerates metadata reads.

Resource Management Layer – master nodes manage physical (data nodes, metadata nodes) and logical resources (volumes, buckets, data shards).

Multi‑Replica Layer – provides both persistent multi‑replica storage and an elastic cache for fast lake access.

Erasure‑Code Storage Layer – uses Reed‑Solomon coding to achieve EB‑scale storage with low cost and multi‑AZ deployment.

4. Key Technologies

4.1 Metadata Management

CBFS provides a hierarchical namespace where each MetaNode contains thousands of MetaPartitions, each consisting of an InodeTree (B‑Tree) and a DentryTree (B‑Tree). Dentries are indexed by parentId and name; Inodes are indexed by inodeId. Multi‑Raft ensures high availability and consistency. Dynamic scaling is achieved by splitting partitions when resource thresholds are approached.

To accelerate flat‑namespace object lookups, CBFS introduces a PathCache that caches the parent Dentry of an object key, dramatically reducing lookup latency for deep paths.

4.2 Erasure‑Code Storage

Erasure coding (EC) reduces storage cost while maintaining high durability. CBFS adopts Reed‑Solomon (RS) coding and also supports Local Reconstruction Codes (LRC) to lower repair bandwidth. The following matrix illustrates RS encoding:

Encoding matrix B ∈ R^(k+m)×k, where the top k rows form an identity matrix I and the bottom m rows form the coding matrix. The vector of k+m blocks contains the original data blocks and m parity blocks.

CBFS supports both offline EC (encode after all k data blocks are written) and online EC (encode and write parity blocks concurrently). It also offers cross‑AZ multi‑mode EC configurations such as 1AZ‑RS, 2AZ‑RS, and 3AZ‑LRC, allowing different redundancy levels and availability guarantees within the same cluster.

4.3 EC Write Path

Data is streamed, sliced into chunks, encoded into data and parity blocks, and then dispatched concurrently to storage nodes. A minimal‑write‑replication (NRW) protocol ensures that writes succeed as long as the required number of replicas are stored, providing resilience against node or network failures.

4.4 EC Read Path

Reads follow the same NRW model: any k out of k+m blocks (data or parity) are sufficient to reconstruct the original data. The Access layer prefers nearby or low‑load EC‑Nodes to minimize latency.

4.5 Data Lake Access Acceleration

CBFS employs multi‑level caching: a local cache co‑located with compute nodes (memory/PMem/NVMe/HDD) for ultra‑low latency, and a distributed cache with configurable replication and proactive pre‑warming. Additionally, predicate push‑down reduces data transfer between storage and compute, improving query performance.

Future Outlook

CBFS 2.x is open‑source and supports online EC and lake acceleration. Version 3.0, expected to be released in October 2021, will add features such as direct HDFS cluster mounting (no data migration) and intelligent hot‑cold tiering, further facilitating the migration of legacy data into the lake for big‑data and AI workloads.

Thank you for reading.

cloud nativedata lakemetadata managementErasure Codingbig data storageCBFS
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.