Big Data 7 min read

Master Distributed Storage: HDFS, Ceph, and Swift Explained

This article introduces distributed storage concepts, outlines its five key characteristics, compares major architectures such as HDFS, Ceph, and Swift, and highlights common application scenarios like big‑data processing, cloud storage, databases, and distributed file systems.

Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Master Distributed Storage: HDFS, Ceph, and Swift Explained

Distributed Storage

Distributed storage technology is a new data processing approach that distributes data across multiple nodes and creates logical links, forming a virtual storage device.

With the development of internet technology, distributed storage is increasingly used, leveraging network advantages to virtualize scattered storage spaces into a unified whole.

Characteristics of Distributed Storage

Distributed storage has five main characteristics:

1. High reliability

Redundant copies and data distribution ensure data integrity and availability.

2. Strong scalability

Nodes can be added or removed dynamically according to storage needs.

3. Excellent performance

Distributed computing and data placement improve read/write performance.

4. Data redundancy

Multiple replica techniques prevent data loss.

5. Cost efficiency

Using lower‑cost servers to build a cluster makes it more cost‑effective than a single storage system.

Distributed Storage Architecture Technologies

Common implementations include HDFS, Ceph, GFS, and Swift.

1. Centralized control node architecture (HDFS)

HDFS is used for large‑scale data storage in Hadoop ecosystems. It stores all data and can be deployed on inexpensive large clusters, reducing deployment costs.

In HDFS, servers are divided into NameNode, which manages metadata, and DataNode, which stores actual data.

Typical workflow:

Client requests metadata from NameNode.

NameNode returns metadata to client.

Client reads/writes data to the appropriate DataNode.

DataNodes replicate data to meet the required replica count.

DataNodes periodically send heartbeat messages to NameNode.

HDFS architecture diagram
HDFS architecture diagram

2. Fully decentralized architecture – Ceph

Ceph is a popular open‑source distributed storage system offering high scalability, performance, and reliability.

Unlike HDFS, Ceph has no central node; the client calculates the data placement using a mapping mechanism.

Ceph core components

Ceph components diagram
Ceph components diagram

OSD processes handle physical storage; each disk runs an OSD process for storing, replicating, balancing, and recovering data.

PG (placement group) is a virtual grouping concept in Ceph.

Pool defines logical partitions of objects, specifying redundancy type and replica distribution, supporting replicated and erasure‑coded strategies.

Relationships:

A Pool contains many PGs.

A PG holds many objects, each belonging to a single PG.

PGs are distributed across OSDs, providing replica placement.

Ceph provides three storage types:

Block storage (RBD)

Object storage (RADOS Gateway)

File system (CephFS)

3. Fully decentralized architecture – Swift

Swift, part of the OpenStack project, is an object storage system that uses consistent hashing to locate data.

Application Scenarios of Distributed Storage

Distributed storage is used in various scenarios:

Big data processing : Enables storage and processing of massive datasets.

Cloud storage : Powers public cloud services such as Amazon S3 and Azure Blob Storage.

Databases : Supports distributed databases like Cassandra and MongoDB for high‑performance, high‑availability data.

Distributed file systems : Examples include Hadoop HDFS for storing large files.

Network storage : Provides file and object storage over networks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSwiftcloud storagedistributed storageHDFSCeph
Mike Chen's Internet Architecture
Written by

Mike Chen's Internet Architecture

Over ten years of BAT architecture experience, shared generously!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.