How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation
This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.
Today we share Baidu Intelligent Cloud’s data lake acceleration solution for big‑data compute‑storage separation scenarios.
1. Big Data Solution Overview
1.1 Evolution of Big Data Technologies
Big‑data started with single‑machine architectures, then moved to parallel MPP, Hadoop in 2006, and later engines such as MapReduce, Hive, Spark, Flink. Since 2013, cloud‑native features like data lakes and compute‑storage separation have emerged.
1.2 Baidu Intelligent Cloud Big Data Solution
The bottom layer uses BOS object storage, which stores trillions of files and tens of exabytes. Compute engines on the data lake include:
BMR, a fully managed big‑data platform compatible with the open‑source ecosystem, offering rich components, cluster management and elastic scaling.
Doris, an enterprise data warehouse with materialized view vectorization, modern MPP architecture and columnar storage for PB‑scale queries.
On top sits the EasyDAP platform for one‑stop data integration, governance, development, analysis and service with unified metadata management.
The data processing flow includes data collection, storage, compute‑analysis and application. Data is ingested via Kafka, log services, real‑time or incremental sync from relational databases (Oracle, MySQL, SQL Server) and semi‑structured sources into BOS or HDFS, then processed by BMR or Doris.
2. Advantages and Challenges of Compute‑Storage Separation
Elasticity : Compute and storage can be scaled independently, avoiding resource waste.
Cost Efficiency : Object storage enables hot‑cold tiering; BOS’s sixth‑level storage reduces cold data cost by up to 87.5% compared to 3‑replica HDFS. Compute resources can be dynamically provisioned and billed per use.
Lower Operations Cost : Maintenance shifts to the cloud provider, eliminating HDFS’s Namenode bottleneck and scaling challenges.
Challenges include:
Flat Namespace of Object Storage : Renaming directories requires listing and moving each file, leading to time proportional to file count and potential partial failures.
Long I/O Path : Accessing data via BOS involves multiple layers (load balancer, Webservice, metadata lookup, storage nodes), roughly double the hops of HDFS.
High Bandwidth Consumption : Separate compute and storage clusters generate massive cross‑cluster traffic, stressing network resources.
Data Locality Loss : Compute nodes cannot sense data placement, leading to non‑optimal I/O patterns.
3. Baidu Cloud Data Lake Acceleration Solutions
Two parts:
Native hierarchical Namespace for BOS to overcome flat‑namespace performance and atomicity issues.
RapidFS, a metadata and data cache placed on compute nodes.
3.1 Native Hierarchical Namespace
Transforms flat directories into a tree structure, enabling constant‑time rename operations regardless of file count. APIs remain compatible; users can switch between flat and hierarchical namespaces with a single click.
The underlying storage uses a distributed KV store, supporting billions of objects per bucket and achieving >100k ops/s after optimizations such as read‑write locks, memory cache, batch commit and strong consistency reads.
3.2 RapidFS Cache Acceleration
Provides two functions: metadata acceleration (Cache mode mirrors BOS metadata; Block mode stores metadata locally) and data‑plane caching (identifies hot file types, pre‑fetches index segments). Access interfaces include FUSE mounting and Java SDK, with DataServer handling write‑through synchronization to BOS.
Performance tests show:
Hierarchical vs. flat Namespace improves metadata‑intensive queries by up to 40%.
RapidFS adds >15% average performance gain, with I/O‑intensive queries improving >30%.
4. Best Practices
Recommended usage patterns:
BOS hierarchical Namespace + RapidFS Cache mode for multiple compute clusters sharing the same data.
BOS flat Namespace + RapidFS Block mode for one‑to‑one compute‑cluster and bucket relationships.
BOS flat Namespace + RapidFS Cache mode for legacy workloads needing a non‑intrusive acceleration layer.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
