Design and Optimization of the Ozone Distributed Object Storage System
This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.
Background : As data volumes and unstructured data grow, traditional storage solutions become insufficient, prompting the adoption of object storage for its scalability, reliability, and cost‑effectiveness. Ozone, developed by Qihoo 360, offers a multi‑tenant, high‑performance object storage platform suitable for cloud computing and big‑data analytics.
Basic Technology Overview
Ozone is a Hadoop distributed object storage system that supports billions of objects and can run in containerized environments such as Kubernetes. It provides Java APIs, S3 compatibility, and command‑line tools, and its management model consists of volumes, buckets, and keys.
Architecture
The system separates namespace management (handled by the Ozone Manager, OM) from block storage (managed by the Storage Container Manager, SCM). Data resides on Datanodes, replicated via the Raft protocol with multi‑Raft pipelines. SCM and Datanodes together expose a Hadoop Distributed Data Store (HDDS) interface.
Optimizations and Improvements
Metadata Subsystem : To overcome RocksDB size limits and complex Raft replication, metadata is moved to a distributed KV store (Apache Cassandra), eliminating snapshot logic and simplifying consistency.
Metadata Read/Write Separation : With metadata stored in KV, read requests bypass OM followers, using a client‑side cache for container locations, while write paths remain unchanged.
TableCache Delayed Cleanup : TableCache is leveraged to reduce open‑key table reads, improving write throughput by caching table entries until they can be safely evicted.
Multi‑OM Connections : Multiple OM instances are stateless and can share the same KV store, distributing RPC load via hash‑based bucket/object routing.
Small‑File Handling : New container types (KeyValueContainer and AppendOnlyContainer) aggregate small files into larger blocks, introduce dedicated PutSmallFile/GetSmallFile RPCs, and support EC after aggregation to avoid space waste.
Erasure Coding for Small Files : Small files are first replicated, then merged into a large object that undergoes EC, with replica management ensuring consistency.
File Lifecycle Management : TTL features of Cassandra and lifecycle flags in containers enable automatic expiration and cleanup of time‑bound data.
Future Outlook
Planned enhancements include supporting multiple SCM groups for unlimited scalability, multi‑AZ data distribution with LRC erasure coding, NVMe‑based read caching, SSD write‑back buffers, streamlined small‑file write paths, zero‑copy data transfer, configurable seek‑read optimizations, EC pipeline balancing, and multipart upload metadata indexing.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.