Master Distributed Storage: HDFS, Ceph, and Swift Explained
This article introduces distributed storage concepts, outlines its five key characteristics, compares major architectures such as HDFS, Ceph, and Swift, and highlights common application scenarios like big‑data processing, cloud storage, databases, and distributed file systems.
Distributed Storage
Distributed storage technology is a new data processing approach that distributes data across multiple nodes and creates logical links, forming a virtual storage device.
With the development of internet technology, distributed storage is increasingly used, leveraging network advantages to virtualize scattered storage spaces into a unified whole.
Characteristics of Distributed Storage
Distributed storage has five main characteristics:
1. High reliability
Redundant copies and data distribution ensure data integrity and availability.
2. Strong scalability
Nodes can be added or removed dynamically according to storage needs.
3. Excellent performance
Distributed computing and data placement improve read/write performance.
4. Data redundancy
Multiple replica techniques prevent data loss.
5. Cost efficiency
Using lower‑cost servers to build a cluster makes it more cost‑effective than a single storage system.
Distributed Storage Architecture Technologies
Common implementations include HDFS, Ceph, GFS, and Swift.
1. Centralized control node architecture (HDFS)
HDFS is used for large‑scale data storage in Hadoop ecosystems. It stores all data and can be deployed on inexpensive large clusters, reducing deployment costs.
In HDFS, servers are divided into NameNode, which manages metadata, and DataNode, which stores actual data.
Typical workflow:
Client requests metadata from NameNode.
NameNode returns metadata to client.
Client reads/writes data to the appropriate DataNode.
DataNodes replicate data to meet the required replica count.
DataNodes periodically send heartbeat messages to NameNode.
2. Fully decentralized architecture – Ceph
Ceph is a popular open‑source distributed storage system offering high scalability, performance, and reliability.
Unlike HDFS, Ceph has no central node; the client calculates the data placement using a mapping mechanism.
Ceph core components
OSD processes handle physical storage; each disk runs an OSD process for storing, replicating, balancing, and recovering data.
PG (placement group) is a virtual grouping concept in Ceph.
Pool defines logical partitions of objects, specifying redundancy type and replica distribution, supporting replicated and erasure‑coded strategies.
Relationships:
A Pool contains many PGs.
A PG holds many objects, each belonging to a single PG.
PGs are distributed across OSDs, providing replica placement.
Ceph provides three storage types:
Block storage (RBD)
Object storage (RADOS Gateway)
File system (CephFS)
3. Fully decentralized architecture – Swift
Swift, part of the OpenStack project, is an object storage system that uses consistent hashing to locate data.
Application Scenarios of Distributed Storage
Distributed storage is used in various scenarios:
Big data processing : Enables storage and processing of massive datasets.
Cloud storage : Powers public cloud services such as Amazon S3 and Azure Blob Storage.
Databases : Supports distributed databases like Cassandra and MongoDB for high‑performance, high‑availability data.
Distributed file systems : Examples include Hadoop HDFS for storing large files.
Network storage : Provides file and object storage over networks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
