Big Data 9 min read

How Ozone Scales Metadata for Massive Big Data Storage

This article explains Ozone's object storage architecture, its evolution of metadata management using distributed KV stores like Apache Cassandra, and the performance optimizations—read/write separation, unlimited scaling, and partitioning—that enable high‑throughput, low‑latency handling of massive datasets.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How Ozone Scales Metadata for Massive Big Data Storage

Object Storage Overview

Object storage is a cloud service offering massive capacity, security, low cost, and high reliability, suitable for any file type and widely adopted in cloud computing, big data analytics, and data backup.

Ozone Overview

Ozone is an open‑source, distributed, multi‑replica object storage system optimized for big‑data scenarios. It integrates with Apache Spark, Hive, YARN, and provides Java API, S3 interface, and CLI. Management entities include volumes, buckets, and keys, analogous to accounts, directories, and files.

Ozone separates namespace management (handled by Ozone Manager, OM) from block storage (managed by Storage Container Manager, SCM). Data resides on Datanodes with replication via the Raft protocol, and SCM+Datanode together form the Hadoop Distributed Data Store (HDDS).

Metadata System Architecture Evolution

The community edition of Ozone Manager originally persisted metadata (volumes, buckets, keys) in local RocksDB and used Apache Ratis for Raft‑based state synchronization, ensuring high availability.

Limitations emerged as metadata grew: single‑node RocksDB became a capacity bottleneck, and Raft‑based snapshot synchronization added complexity.

To address this, metadata storage was moved to a distributed key‑value store, eliminating the need for OM follower synchronization and snapshot handling.

After evaluating distributed KV solutions, Apache Cassandra was selected for its multi‑table batch atomicity, ByteOrderedPartitioner (matching RocksDB ordering), leaderless architecture, tunable consistency, horizontal scalability, multi‑data‑center support, and seamless integration with compute engines like Spark.

Metadata System Challenges and Optimizations

3.1 Read‑Write Separation

Increasing request volume caused high RPC pressure on the OM leader and large GC pauses due to deserialization of big metadata objects. By storing metadata in Cassandra, clients can read directly from the KV store, bypassing OM follower synchronization, and cache container locations for efficient List/Head operations.

3.2 Unlimited Scaling of Metadata Processing

Single OM leader hardware limits RPC throughput. By deploying multiple OM instances and routing requests via a hash‑based TokenMap in the Ozone client, the system distributes load across stateless OM groups, leveraging the shared Cassandra backend for consistency.

3.3 Metadata Partition Management

Data skew caused hot nodes to become bottlenecks. The solution introduced a composite partitioning scheme: first hash‑partition, then range‑partition, by prefixing metadata keys with a generated shardKey, spreading hot data across multiple nodes.

Big DatametadataDistributed KVObject StorageOzoneApache Cassandra
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.