Big Data 7 min read

Alluxio: Memory‑Centric Distributed File System for Big Data Storage and Compute

Alluxio, formerly Tachyon, is a memory‑centric distributed file system that unifies heterogeneous big‑data storage backends, optimizes small files, and provides a fast, unified data access layer between storage systems like S3 or HDFS and compute frameworks such as Spark or Hadoop.

Architects' Tech Alliance

Apr 4, 2017

Alluxio: Memory‑Centric Distributed File System for Big Data Storage and Compute

Alluxio, formerly known as Tachyon, is an in‑memory distributed file system whose technical advantages lie in heterogeneous management of backend big‑data file storage, small‑file optimization, and providing a unified data storage service to big‑data compute frameworks and platforms ; designed around memory, Alluxio sits between storage systems such as Amazon S3, Apache HDFS, or OpenStack Swift and compute frameworks like Apache Spark or Hadoop MapReduce, acting as middleware between the underlying distributed file systems and the upper‑level distributed compute frameworks.

For upper‑level applications, Alluxio serves as a middle layer that manages data access and fast storage; for underlying storage, Alluxio eliminates the dependency gap between big‑data workloads and storage systems, hides storage heterogeneity, and primarily provides file‑based data access services in memory or other storage media. Alluxio supports backend storage including GCS, S3, Swift, GlusterFS, HDFS, MapR‑FS, secure HDFS, Alibaba OSS, and NFS .

Alluxio Application Scenarios

Typically, in the big‑data domain, the lowest layer consists of distributed file systems such as Amazon S3, Apache HDFS, etc., while higher‑level applications are distributed compute frameworks like Spark, MapReduce, HBase, Flink, etc.; these frameworks often read and write data directly from the distributed file systems, resulting in low efficiency and high performance overhead.

Alluxio sits between traditional big‑data storage (e.g., Amazon S3, Apache HDFS, OpenStack Swift) and big‑data compute frameworks (e.g., Spark, Hadoop MapReduce), delivering order‑of‑magnitude acceleration for big‑data applications through optimization, and its generic data access interface allows easy switching of underlying distributed file systems.

Alluxio Components

Alluxio consists of a Master and multiple Workers; logically, Alluxio is composed of a master, workers, and clients. The master and workers cooperate to provide services, overseen by an administrator, while clients—typically big‑data applications such as Spark or MapReduce tasks—initiate data access. In practice, users usually interact only with the client, which offers a unified file access interface.

Alluxio System Architecture

Similar to other big‑data frameworks such as HDFS, HBase, and Spark, Alluxio’s master node can be deployed as a single node or in HA mode with two masters. The master manages global file system metadata (e.g., the file system tree), while clients interact with the master to obtain metadata. Worker nodes manage local storage resources—including memory, SSD, or HDD—on their respective nodes.

When applications like HDFS, HBase, or Spark need to access Alluxio, the client first communicates with the master, then with the appropriate worker to perform actual file operations. All workers periodically send heartbeats to the master to maintain metadata and ensure they are recognized, providing continuous service. Like other distributed systems, the master only replies to requests and does not initiate communication, reducing its workload.

Alluxio Ecosystem

Many big‑data service vendors integrate Alluxio to connect NAS devices to the Hadoop ecosystem and offer open big‑data solutions. Dell EMC, for example, has partnered with Alluxio for its ECS product, jointly delivering storage performance optimization for big‑data applications. Huawei, HDS, HPE, NetApp and others have similar collaborations; through a middle layer, they enable Hadoop, Spark, Storm, Samza, and other compute frameworks to access any backend storage such as AWS S3, HDFS, Ceph, Isilon, Gluster, etc., thereby decoupling compute from storage.

Warm Reminder: Please search “ICT_Architect” or scan the QR code below to follow the public account and get more exciting content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data storage Distributed File System Alluxio Compute Frameworks

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.