Multi-Engine Support and Future Directions of Alibaba Cloud Data Lake Building Service
The article explains how Alibaba Cloud's Data Lake Building Service enables fine‑grained lake management by integrating multiple compute engines—including EMR, MaxCompute, Blink, Hologres, PAI, and open‑source Hive, Spark, and Presto—through unified metadata and OSS storage, while outlining current features, special format support, and planned future enhancements.
Xin Yong, a technical expert at Ant Group and contributor to Apache Hadoop and Spark, focuses on compute engines, storage structures, and big‑data cloudization.
Data lakes are evolving toward fine‑grained management, requiring a shift from direct engine‑storage access to standardized methods, though no industry‑wide standard exists; thus, the breadth of engine support becomes a key metric.
Alibaba Cloud Data Lake Building Service supports a rich set of compute engines, including Alibaba Cloud products such as EMR, MaxCompute (in development), Blink (in development), Hologres (in development), PAI (in development), as well as open‑source engines like Hive, Spark, and Presto.
The integration focuses on two aspects: metadata and storage. Metadata is shared across users via a unified access interface, with custom clients per engine, tenant isolation, and authentication services. Storage relies on the user's OSS buckets; engines need OSS access, which is generally straightforward for Alibaba Cloud services and HDFS‑compatible engines. An optional acceleration service can be used by simply replacing the OSS path on the engine side.
Multi‑Engine Support
EMR : When creating an EMR cluster, users can directly select the Data Lake metadata service as the metastore, with deep authentication integration. The required SDK is pre‑installed, and Spark, Hive, and Presto on EMR are already compatible, providing a seamless lake analysis experience. Metadata can be migrated later, and OSS access is enhanced by JindoFS, offering better performance. EMR also supports AK‑less access via MetaService, which assumes a role to provide temporary credentials, reducing secret key exposure.
Alibaba Cloud Services : MaxCompute, Real‑time Compute, Hologres, and PAI can read/write data through the Data Lake service, enabling a single dataset to serve multiple scenarios such as data warehousing, streaming‑batch integration, high‑performance analytics, and model training.
Open‑Source Engines : Users can patch open‑source code to embed the Data Lake metadata client. The OSSFileSystem contributed by the EMR team enables OSS access for any engine that supports HDFS, while authentication is unified via Alibaba Cloud RAM.
Special Format Support
The service also supports table formats like Delta Lake, Hudi, and Iceberg. For Delta Lake, metadata resides on OSS and is mirrored in the metadata service (Delta Table). A hook tool synchronizes OSS Delta logs to the metadata service on each transaction, ensuring consistency and allowing engines such as Spark, Hive, and Presto to read the same metadata without maintaining separate copies.
Future Work
Enhance Alibaba Cloud service integration for smoother user experience.
Expand support for more open‑source engines, e.g., Impala and Flink.
Add richer features such as comprehensive statistics and transaction interfaces.
Improve performance to surpass local metastore and HDFS storage benchmarks.
For further discussion, readers are invited to join the Alibaba Data Lake technical DingTalk group and explore recommended articles linked at the end of the original document.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.