Why Ozone Is the Next‑Generation Distributed Object Store for Big Data
This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.
Introduction
HDFS is the industry‑default big‑data storage system, widely used for its stability and easy scalability. However, because all filesystem metadata must reside in the NameNode’s memory, HDFS handles large files well but suffers when many small files are present.
The Hadoop community therefore introduced Ozone, a distributed key‑value object storage system that manages both small and large files efficiently.
Object Storage Basics
In object storage, each data unit is stored as a discrete object without a hierarchical filesystem. Implementations often define multiple logical layers.
Ozone’s Three‑Level Hierarchy
Ozone organizes data into Volume → Bucket → Object.
Volume
A Volume is comparable to a user account in Amazon S3; it serves as the home directory. Only administrators can create Volumes, which are used for quota management and can contain any number of Buckets.
Bucket
A Bucket is a container for objects, similar to an S3 bucket or Azure container. Buckets are created under a specific Volume, must have unique names within that Volume, and cannot be renamed after creation. Bucket names are globally unique only within their Volume.
Object
Objects reside in Buckets and follow a key‑value model: the key is the object name, the value is the object’s content. Object names must be unique within their Bucket. Each object carries metadata such as size, creation time, modification time, replication factor, and ACLs. Object size is unlimited.
Access URL Format
[scheme][bucket.volume.server:port]/keyThe scheme can be o3fs (RPC) or http/https (REST API). If omitted, RPC is used. The server:port defaults to the Ozone Manager address defined in ozone-site.xml, or localhost:9862 if unspecified.
Technical Architecture
Ozone’s architecture consists of three components:
Ozone Manager (OM) – unified metadata management.
Storage Container Manager (SCM) – block allocation and DataNode management.
Datanode – stores the actual data.
Compared with HDFS, the former NameNode responsibilities are split between OM and SCM, allowing independent scaling of metadata and data storage.
Ozone Manager (OM)
OM manages the namespace, handling creation, update, and deletion of Volumes, Buckets, and Keys. Metadata is stored in RocksDB and replicated via RATIS (Raft) for high availability, avoiding the need to keep all metadata in memory.
Storage Container Manager (SCM)
SCM functions like HDFS’s Block Manager. It manages Containers, Pipelines, and Datanodes, providing block and container operations to OM. It also receives heartbeats from Datanodes to maintain required replication levels.
Block, Container, and Pipeline
Blocks are the actual data chunks. Each Container records information about its Blocks; data is replicated at the Container level. SCM supports two pipeline types: a standalone pipeline with a single Datanode and a three‑node Apache RATIS write pipeline. Containers have two states: OPEN (writable) and CLOSED (read‑only). When a Container reaches its size limit (default 5 GB), it transitions to CLOSED.
Datanode
Datanodes store Containers and periodically send heartbeats to SCM. When a Container exceeds 90 % of its target size or a write fails, the Datanode requests SCM to close the Container. Similarly, pipeline errors trigger a pipeline close command.
Hierarchical Management
The layered design lets OM, SCM, and Datanodes be scaled independently, and the semantic hierarchy (Volume‑Bucket‑Object) maps directly to these management modules.
Object Creation Process
The Ozone client contacts OM with object metadata (name, size, replication factor, etc.).
OM asks SCM to find an OPEN Container and allocate sufficient Blocks.
SCM returns the selected Container, Blocks, and the three Datanodes forming the write pipeline.
OM forwards this information to the client.
The client writes data to the first Datanode (pipeline leader) in the Container.
After the write, the client notifies OM, which updates the object’s metadata with the Container and Block locations.
Object Read Process
The client requests the object key from OM.
OM looks up the metadata, returns the Container, Block, and Datanode list.
If the client runs on a cluster node, OM orders the Datanode list by network proximity, allowing the client to read from the nearest Datanode, reducing latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
