Big Data 18 min read

How JindoFS Accelerates Data Lakes: Deep Dive into Storage‑Compute Optimization

This article explains why data lake acceleration is essential, outlines the three key architectural decisions for big‑data architects, and details Alibaba Cloud's JindoFS solutions—including basic adaptation, cache acceleration, and deep‑customization modes—to boost performance and reliability of lake storage and compute.

Alibaba Cloud Developer

Sep 15, 2020

How JindoFS Accelerates Data Lakes: Deep Dive into Storage‑Compute Optimization

Lake acceleration (湖加速) is a middleware layer that optimizes and caches data‑lake storage to uniformly support various compute engines; the article explains its necessity and shares Alibaba Cloud’s practice and technical solutions.

In the open‑source big‑data world, storage‑compute separation has become standard, making the data‑lake architecture the primary choice. Architects must consider three questions:

What storage system should be used as lake storage?

How to accelerate and optimize compute after storage‑compute separation (lake acceleration)?

Which compute engine should be chosen for specific scenarios (lake compute)?

Lake acceleration refers to the intermediate layer that provides adaptation, optimization, and caching for data‑lake storage. Early community solutions include Alluxio, Hadoop’s S3A Guard, AWS EMRFS, Snowflake SSD cache, Databricks DBIO/DBFS, and Alibaba EMR JindoFS.

Why is lake acceleration needed? It can be viewed in three layers:

Basic version – adapt object storage.

Standard version – provide caching.

High‑end version – deep customization.

JindoFS covers all three layers, delivering a complete lake‑acceleration solution.

Basic version: Adapting object storage

Early big‑data platforms (Hadoop, AWS EC2/S3) operated in separate worlds. When EMR emerged, adapting Hadoop compute to S3/OSS became a real challenge. Hadoop ecosystem tools (Hive, Spark) originally supported HDFS via the Hadoop Compatible File System (HCFS) interface; they must be adapted to object‑storage APIs. JindoFS fully supports Alibaba Cloud OSS and provides performance optimizations.

Key challenges:

1. Massive scale – OSS offers virtually unlimited storage, leading to directories with tens of millions of files and many small files. JindoFS optimizes listing and du/count operations, achieving up to 2× faster listing and 21% faster du/count on huge directories.

2. File‑object mapping – Object storage uses a flat key namespace, so directory operations like rename require copy‑and‑delete, which is costly. Leveraging OSS fast‑copy and high concurrency, JindoFS makes rename on million‑file directories up to 3× faster than community solutions.

3. Consistency – OSS provides strong consistency, simplifying handling of eventual‑consistency issues present in S3.

4. Atomicity – Directory operations are not atomic in object storage; JindoFS adds retry and rollback mechanisms to improve reliability.

5. Breaking limits – By exploiting OSS’s Concurrent MultiPart Upload (CMPU), JindoFS implements a job‑committer that avoids rename operations and supports exactly‑once semantics for frameworks like Flink.

Standard version: Cache acceleration

Storage‑compute separation decouples storage capacity from compute elasticity, but introduces bandwidth bottlenecks when many compute workloads (ETL, interactive analytics, ML training) concurrently access the same OSS bucket. The industry answer is compute‑side caching. JindoFS provides distributed caching on HDD, SSD, and memory, transparent to applications; enabling a simple configuration switch to activate cache.

Performance gains include: SparkSQL 27% faster, Presto up to 93% faster, Hive ETL 42% faster on TPC‑DS 1 TB benchmark; TensorFlow training 40% faster via JindoFuse on SSD cache.

High‑end version: Deep customization and metadata management

For scenarios requiring full HDFS‑like semantics (rename, atomic directory ops, advanced features like truncate, append, snapshots, X‑attributes) and migration from on‑prem HDFS, JindoFS offers a block mode. Files are stored in OSS blocks, metadata is persisted asynchronously to Alibaba Cloud OTS and cached locally in RocksDB with LRU, supporting billions of files. Raft‑based metadata service provides HA and multi‑namespace support. In benchmarks, JindoFS achieves 60% higher IOPS than HDFS, 130% faster listing on massive directories, and 33% higher read throughput.

Block mode also enables integration with engines beyond Hadoop/Spark/Presto, such as HBase, Kafka, Kudu, ClickHouse, allowing them to benefit from storage‑compute separation while retaining high performance.

Summary

Upgrading big‑data platforms to a data‑lake architecture—comprising lake storage, lake acceleration, and lake analytics—is a clear industry trend. Alibaba Cloud’s JindoFS delivers multiple acceleration solutions (basic adaptation, cache acceleration, deep customization) integrated with Data Lake Formation, EMR, and DataWorks, providing a comprehensive, high‑performance data‑lake ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data cloud storage OSS JindoFS storage acceleration

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.