Mobile Cloud LakeHouse: Cloud‑Native Big Data Analytics Architecture and Practices
This article introduces the cloud‑native LakeHouse solution from China Mobile Cloud, covering its lake‑warehouse integration concept, overall architecture, core functions such as storage‑compute separation, one‑click data ingestion, intelligent metadata discovery, serverless execution, JDBC support, incremental updates, and typical application scenarios in public and private clouds.
In the era of cloud‑native big data, the explosive growth of business data and the demand for low latency have driven the evolution from traditional data warehouses to data lakes and finally to lake‑warehouse integration (LakeHouse). The concept originated from Databricks and combines the flexibility of data lakes with the governance and performance of data warehouses.
1. Lake‑Warehouse Overview
The LakeHouse architecture enables seamless, automated data flow between lake and warehouse without manual intervention, and automatically caches and moves data according to defined rules to support agile analytics and deep intelligence.
Key points:
Bidirectional data/metadata flow without user intervention.
Automatic caching and movement between lake and warehouse, supporting advanced analytics.
2. Mobile Cloud LakeHouse Practice
The solution adopts a compute‑and‑storage‑separated architecture built on Mobile Cloud Object Storage (EOS) and an internal HDFS layer with Hudi for upsert capabilities. Spark provides interactive queries.
Components:
Data sources: RDB, Kafka, HDFS, EOS, FTP – ingested via FlinkX.
Data storage (lake): HDFS + EOS with Hudi for near‑real‑time incremental updates; Alluxio is used for caching to accelerate SQL queries.
Compute engine: Serverless Spark/Presto/Flink running on Kubernetes, scheduled by YuniKorn (YARN‑like).
Intelligent metadata: Automatic discovery and Hive‑style metadata management.
Data development: SQL Console, SDK, JDBC/ODBC, future DevIDE support.
Core Functions
Storage‑Compute Separation : Independent elastic scaling, object storage for unstructured/cold data, HDFS for structured data, multi‑engine support (Spark, Presto, Flink) in a serverless fashion.
One‑Click Ingestion : Connect to various databases, storage, and message queues; automated ingestion with low source load (<10%); supports incremental updates via Hudi.
Intelligent Metadata Discovery : Automatic identification of structured and semi‑structured files, unified Hive‑like API, dynamic bucket authentication for object storage, and fine‑grained permission control.
Pay‑Per‑Use Compute : Storage billed by usage, compute supports multiple billing modes, elastic tenant‑level resource scaling.
3. Logical View Based on RBF
Each Hive database maps to an isolated RBF schema path, providing multi‑tenant logical isolation and load‑balanced NameNode access. Example command to mount a Hive directory to two namespaces:
$ hdfs dfsrouteradmin -add/hivedbdir ns1,ns2 /data -order HASH_ALL4. Multi‑Tenant Hive on Object Storage
By adding S3 authentication parameters to table properties, Hive can access multiple buckets without restarting services. Example DDL:
create external table testcephtbl(id int) location 's3a://bucket1/tmp/testlocation' tblproperties('fs.s3a.access.key'='xxx','fs.s3a.endpoint'='xxx','fs.s3a.secret.key'='xxx');5. Serverless Execution
Spark jobs run in isolated Kubernetes namespaces; resources are allocated on demand and released after execution. Users submit tasks via SQL Console or JDBC, and the engine dynamically provisions Spark clusters via Kyuubi.
Example Spark‑Beeline command with embedded authentication parameters:
$SPARK_HOME/bin/beeline -u "jdbc:hive2://host:port/default?fs.s3a.access.key=xxx;fs.s3a.endpoint=xxx" -e "select a.id from test1 a join test2 b on a.id=b.id"Final submission without explicit auth parameters:
$SPARK_HOME/bin/beeline -u "jdbc:hive2://host:port/default" -e "select a.id from test1 a join test2 b on a.id=b.id"6. JDBC Support via Kyuubi
Kyuubi provides multi‑tenant, high‑availability JDBC services with dynamic resource management on Kubernetes, integrated with Mobile Cloud’s AccessKey/SecretKey authentication and fine‑grained permission control.
7. Incremental Updates with Hudi
Hudi enables ACID semantics on top of HDFS or object storage, supporting both Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) table types. Typical real‑time scenarios include ingesting Kafka/MySQL binlog streams via DeltaStreamer/CDC, syncing Hive metadata, and user queries.
Application Scenarios
The LakeHouse platform supports diverse data sources (application logs, database extracts, etc.) and provides offline batch, real‑time, and interactive analytics, reducing the need for extensive hardware, development, and operations. In private clouds, it can be added as a component to existing Hadoop clusters, offering unified metadata views and multi‑tenant isolation.
Overall, Mobile Cloud LakeHouse delivers a cloud‑native, serverless, and cost‑effective big data analytics solution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
