Big Data 9 min read

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

The article explains the concept and advantages of data lakes, outlines the major storage and acceleration challenges they face, provides a checklist for ideal data‑lake solutions, and details how Alibaba Cloud's JindoFS addresses those challenges with object‑storage‑based, high‑performance, scalable features.

Big Data Technology & Architecture

Apr 25, 2021

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

Author Zheng Kai (aka TieJie), a senior technical expert at Alibaba and Apache Hadoop PMC, introduces his background in distributed systems and big‑data platform development on Alibaba Cloud.

The talk, originally presented at the 2020 Big Data + AI Meetup in Shanghai, defines a data lake as a unified storage for all enterprise data—including structured, semi‑structured, and multimedia—enabling BI and AI analytics on raw data.

Key benefits of a data lake are highlighted: breaking data silos, supporting diverse compute workloads, providing elasticity for cost‑effective scaling, and offering centralized management and governance.

Significant challenges are then discussed, such as massive data volumes (PB/EB scale), huge and deep directory structures, high storage costs, and the inherent separation of storage and compute that demands high‑throughput, low‑latency access.

A checklist of ideal data‑lake storage and acceleration capabilities is presented, covering large‑scale object storage, efficient large‑directory metadata operations, flexible caching acceleration, tight compute integration, support for modern table formats, archiving/compression/security, comprehensive big‑data + AI ecosystem compatibility, and robust migration tools.

The article then introduces Alibaba Cloud's JindoFS, describing three main components: (1) an optimized OSS‑based SDK for Hadoop, Spark, and AI workloads focusing on metadata and rename optimizations, IO performance, and versioning; (2) a distributed caching system ensuring metadata and data consistency, disk caching, load balancing, and LRU eviction; (3) an OSS‑backed storage extension providing metadata management, fine‑grained locking, data chunking, backup, and high horizontal scalability.

Mapping JindoFS to the checklist shows support for massive object‑storage capacity, superior large‑directory operations, >50% performance gains from flexible caching, compute‑aware optimizations via JindoTable, compatibility with Delta, Hudi, and Iceberg table formats, archiving/compression/security features, full big‑data + AI ecosystem integration, and partial migration capabilities through an optimized JindoDistCp tool.

Overall, the article serves as a practical guide for architects evaluating data‑lake solutions and demonstrates how JindoFS meets the outlined criteria on Alibaba Cloud.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

storage architecture Data Lake Alibaba Cloud Hadoop object-storage JindoFS

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.