Big Data 17 min read

Mobile Cloud LakeHouse: Cloud‑Native Big Data Analytics Architecture and Practices

This article introduces the cloud‑native LakeHouse solution from China Mobile Cloud, covering its lake‑warehouse integration concept, overall architecture, core functions such as storage‑compute separation, one‑click data ingestion, intelligent metadata discovery, serverless execution, JDBC support, incremental updates, and typical application scenarios in public and private clouds.

DataFunTalk

Dec 9, 2021

Mobile Cloud LakeHouse: Cloud‑Native Big Data Analytics Architecture and Practices

In the era of cloud‑native big data, the explosive growth of business data and the demand for low latency have driven the evolution from traditional data warehouses to data lakes and finally to lake‑warehouse integration (LakeHouse). The concept originated from Databricks and combines the flexibility of data lakes with the governance and performance of data warehouses.

1. Lake‑Warehouse Overview

The LakeHouse architecture enables seamless, automated data flow between lake and warehouse without manual intervention, and automatically caches and moves data according to defined rules to support agile analytics and deep intelligence.

Key points:

Bidirectional data/metadata flow without user intervention.

Automatic caching and movement between lake and warehouse, supporting advanced analytics.

2. Mobile Cloud LakeHouse Practice

The solution adopts a compute‑and‑storage‑separated architecture built on Mobile Cloud Object Storage (EOS) and an internal HDFS layer with Hudi for upsert capabilities. Spark provides interactive queries.

Components:

Data sources: RDB, Kafka, HDFS, EOS, FTP – ingested via FlinkX.

Data storage (lake): HDFS + EOS with Hudi for near‑real‑time incremental updates; Alluxio is used for caching to accelerate SQL queries.

Compute engine: Serverless Spark/Presto/Flink running on Kubernetes, scheduled by YuniKorn (YARN‑like).

Intelligent metadata: Automatic discovery and Hive‑style metadata management.

Data development: SQL Console, SDK, JDBC/ODBC, future DevIDE support.

Core Functions

Storage‑Compute Separation : Independent elastic scaling, object storage for unstructured/cold data, HDFS for structured data, multi‑engine support (Spark, Presto, Flink) in a serverless fashion.

One‑Click Ingestion : Connect to various databases, storage, and message queues; automated ingestion with low source load (<10%); supports incremental updates via Hudi.

Intelligent Metadata Discovery : Automatic identification of structured and semi‑structured files, unified Hive‑like API, dynamic bucket authentication for object storage, and fine‑grained permission control.

Pay‑Per‑Use Compute : Storage billed by usage, compute supports multiple billing modes, elastic tenant‑level resource scaling.

3. Logical View Based on RBF

Each Hive database maps to an isolated RBF schema path, providing multi‑tenant logical isolation and load‑balanced NameNode access. Example command to mount a Hive directory to two namespaces:

$ hdfs dfsrouteradmin -add/hivedbdir ns1,ns2 /data -order HASH_ALL

4. Multi‑Tenant Hive on Object Storage

By adding S3 authentication parameters to table properties, Hive can access multiple buckets without restarting services. Example DDL:

create external table testcephtbl(id int) location 's3a://bucket1/tmp/testlocation' tblproperties('fs.s3a.access.key'='xxx','fs.s3a.endpoint'='xxx','fs.s3a.secret.key'='xxx');

5. Serverless Execution

Spark jobs run in isolated Kubernetes namespaces; resources are allocated on demand and released after execution. Users submit tasks via SQL Console or JDBC, and the engine dynamically provisions Spark clusters via Kyuubi.

Example Spark‑Beeline command with embedded authentication parameters:

$SPARK_HOME/bin/beeline -u "jdbc:hive2://host:port/default?fs.s3a.access.key=xxx;fs.s3a.endpoint=xxx" -e "select a.id from test1 a join test2 b on a.id=b.id"

Final submission without explicit auth parameters:

$SPARK_HOME/bin/beeline -u "jdbc:hive2://host:port/default" -e "select a.id from test1 a join test2 b on a.id=b.id"

6. JDBC Support via Kyuubi

Kyuubi provides multi‑tenant, high‑availability JDBC services with dynamic resource management on Kubernetes, integrated with Mobile Cloud’s AccessKey/SecretKey authentication and fine‑grained permission control.

7. Incremental Updates with Hudi

Hudi enables ACID semantics on top of HDFS or object storage, supporting both Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) table types. Typical real‑time scenarios include ingesting Kafka/MySQL binlog streams via DeltaStreamer/CDC, syncing Hive metadata, and user queries.

Application Scenarios

The LakeHouse platform supports diverse data sources (application logs, database extracts, etc.) and provides offline batch, real‑time, and interactive analytics, reducing the need for extensive hardware, development, and operations. In private clouds, it can be added as a component to existing Hadoop clusters, offering unified metadata views and multi‑tenant isolation.

Overall, Mobile Cloud LakeHouse delivers a cloud‑native, serverless, and cost‑effective big data analytics solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Serverless Big Data kubernetes Metadata Data Integration

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.