Big Data 16 min read

How JindoData Transforms Data Lake Storage for the Big Data Era

This article reviews Sun DaPeng's presentation on Alibaba Cloud's open‑source big data platform, covering the rapid growth of data, the evolution of storage architectures from HDFS to cloud‑native data lakes, and the detailed JindoData solution—including JindoFS, JindoFSx, and JindoSDK—that delivers high‑performance, cost‑effective storage for modern analytics workloads.

Alibaba Cloud Big Data AI Platform

Sep 27, 2022

How JindoData Transforms Data Lake Storage for the Big Data Era

Abstract: This article compiles Sun DaPeng’s talk at the Alibaba Cloud Data Lake technical session on July 17, focusing on background information and the JindoData data lake storage solution.

Background Introduction

The big data industry is booming, driven by advances in communications technology; global data volume is projected to reach 163 ZB by 2025, averaging 27 TB per person. This explosion demands larger, more valuable storage and drives business transformation.

Trends in big data technology include cloudification, lightweight services, and real‑time computing. The separation of storage and compute, once proposed in early cloud computing, reduces costs and leverages high‑bandwidth networks.

Industry Storage Architecture Evolution

Early data warehouses were built on HDFS, which supports up to 100 PB and hundreds of millions of files but incurs high operational costs and scaling limits.

HDFS Federation extended metadata and capacity horizontally by allowing multiple NameNodes, yet operational complexity grew.

To address these issues, the industry shifted to object storage‑based data lakes, leveraging the high availability and throughput of cloud object stores.

From Integrated Storage to Cloud‑Native Data Lakes

Open‑source big data platforms have evolved through several generations:

Generation 1 (EMR): Migrated Hadoop to the cloud using ECS local disks, simplifying deployment but not solving HDFS operational challenges.

Generation 2 (EMR): Integrated OSS/S3 object storage as storage connectors, enabling hot and cold data tiering.

Generation 3: Moved all metadata to the cloud (DLF replaces Hive Metastore), using OSS for storage and achieving a largely stateless cluster.

Data Lake Storage Evolution Roadmap

Data Lake 1.0: Compute‑storage separation, hot/cold layers, Hadoop‑centric.

Data Lake 2.0: Object‑storage‑centric, unified storage for large‑scale, high‑performance workloads.

Data Lake 3.0: Object‑storage‑centric with full compatibility, multi‑protocol support, and unified metadata.

JindoData Data Lake Storage Solution

JindoData builds on OSS to provide a high‑performance storage system. JindoFS adds file‑metadata and directory capabilities, enabling operations like rename that are difficult on pure object storage. JindoFSx accelerates bandwidth‑limited workloads.

Both HDFS API and POSIX API are exposed, allowing seamless integration with Flink, Hadoop, Spark, and AI training pipelines (e.g., storing TFRecord files via POSIX).

JindoSDK offers ecosystem tools for integration, including JindoDistCP for data migration, JindoTable for table‑level operations, and JindoShell for user interaction. Underlying data sources can be HDFS, object storage, NAS, or Alibaba Cloud OSS.

JindoSDK: Super Data Lake SDK

The SDK provides native C++ core APIs (ObjectStore, DataStream, FileSystem) wrapped for Python (Cython), Hadoop (JNI), and Jindo Fuse (C SDK), delivering consistent performance improvements across metadata operations, I/O paths, security, and STS configuration.

Performance tests show the Jindo SDK averages a 2.2× speedup over the OSS Java SDK.

JindoFS: High‑Performance Storage Built on OSS

JindoFS addresses the scalability limits of large HDFS clusters by replacing the multi‑service architecture with a Raft‑based metadata service consisting of three NamespaceService nodes, enabling minute‑level restarts and reducing operational overhead.

It leverages OSS for underlying storage, offering hot/cold tiering (standard, infrequent, archive, cold archive) to optimize cost.

JindoFSx: Storage Acceleration System

Deployed on worker nodes, JindoFSx provides distributed caching, reducing network latency and bandwidth constraints for remote object storage. It includes a NamespaceService for metadata acceleration and integrates with JindoSDK to serve ETL, interactive analytics, real‑time computing, and machine learning workloads.

Ecosystem Tools and Use Cases

JindoTable enables table‑level hot/cold data conversion and lifecycle management across OSS and HDFS. JindoDistCP handles heterogeneous checksum verification and dynamic, parallel copying to improve bandwidth utilization.

Additional optimizations include fast copy (metadata‑only indexing), atomic rename support on OSS via OTS, and flush‑like guarantees for Flink and Flume writers.

Q&A

Q: How does JindoFS performance compare to HDFS? A: JindoFS shows advantages in rename and delete operations, especially on large directories, where HDFS latency grows with file count while JindoFS remains stable.

Q: How many files can a three‑node JindoFS cluster support? A: The early three‑node Raft setup can handle billions of files, roughly double HDFS capacity; the cloud‑native 3.0 architecture can horizontally scale metadata services.

Q: How does JindoFS differ from Alluxio? A: JindoFS focuses on metadata management and OSS‑specific optimizations, whereas Alluxio provides a general distributed caching layer; JindoFSx offers similar caching with native OSS performance benefits.

Q: Is deploying HDFS on Kubernetes practical? A: HDFS requires stateful nodes and complex decommissioning and balancing; running it on Kubernetes adds operational overhead without clear advantages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cloud JindoData

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.