Big Data 15 min read

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

360 Smart Cloud

Jan 15, 2024

Design and Optimization of the Ozone Distributed Object Storage System

Background : As data volumes and unstructured data grow, traditional storage solutions become insufficient, prompting the adoption of object storage for its scalability, reliability, and cost‑effectiveness. Ozone, developed by Qihoo 360, offers a multi‑tenant, high‑performance object storage platform suitable for cloud computing and big‑data analytics.

Basic Technology Overview

Ozone is a Hadoop distributed object storage system that supports billions of objects and can run in containerized environments such as Kubernetes. It provides Java APIs, S3 compatibility, and command‑line tools, and its management model consists of volumes, buckets, and keys.

Architecture

The system separates namespace management (handled by the Ozone Manager, OM) from block storage (managed by the Storage Container Manager, SCM). Data resides on Datanodes, replicated via the Raft protocol with multi‑Raft pipelines. SCM and Datanodes together expose a Hadoop Distributed Data Store (HDDS) interface.

Optimizations and Improvements

Metadata Subsystem : To overcome RocksDB size limits and complex Raft replication, metadata is moved to a distributed KV store (Apache Cassandra), eliminating snapshot logic and simplifying consistency.

Metadata Read/Write Separation : With metadata stored in KV, read requests bypass OM followers, using a client‑side cache for container locations, while write paths remain unchanged.

TableCache Delayed Cleanup : TableCache is leveraged to reduce open‑key table reads, improving write throughput by caching table entries until they can be safely evicted.

Multi‑OM Connections : Multiple OM instances are stateless and can share the same KV store, distributing RPC load via hash‑based bucket/object routing.

Small‑File Handling : New container types (KeyValueContainer and AppendOnlyContainer) aggregate small files into larger blocks, introduce dedicated PutSmallFile/GetSmallFile RPCs, and support EC after aggregation to avoid space waste.

Erasure Coding for Small Files : Small files are first replicated, then merged into a large object that undergoes EC, with replica management ensuring consistency.

File Lifecycle Management : TTL features of Cassandra and lifecycle flags in containers enable automatic expiration and cleanup of time‑bound data.

Future Outlook

Planned enhancements include supporting multiple SCM groups for unlimited scalability, multi‑AZ data distribution with LRC erasure coding, NVMe‑based read caching, SSD write‑back buffers, streamlined small‑file write paths, zero‑copy data transfer, configurable seek‑read optimizations, EC pipeline balancing, and multipart upload metadata indexing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Optimization Big Data scalability Hadoop object-storage Ozone

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.