Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake
This article presents a comprehensive overview of data lake implementations, detailing Huawei's production‑scene platform, a real‑time financial data lake architecture using Kafka, Flink and Iceberg, and Soul's Delta Lake practice with Spark, Hive, and custom ETL tools, highlighting design choices, processing flows, and operational considerations.
Huawei Production Scenario Data Lake Platform Practice
Data lake is an evolving, scalable infrastructure for big data storage, processing, and analysis, enabling full acquisition, storage, multi‑model processing, and lifecycle management of data from any source, speed, scale, or type.
The platform is organized into three logical modules and supports various data application scenarios:
Structured data: batch processing and virtual mirroring to Hive, pre‑processed by Kylin into Cubes, exposed via REST API for high‑concurrency sub‑second queries to monitor material quality.
IoT data: sensor data ingested to MQS, streamed via Storm to HBase, then processed by algorithm models for ICT material early‑warning monitoring.
Barcode data: ETL loader to IQ columnar lake, cleaned and processed to support trillion‑scale barcode scanning operations.
Unstructured quality‑inspection images are uploaded via a web front‑end and API, stored in HBase with unique IDs as rowkeys, enabling fast storage and retrieval. Asset construction includes unified indexing, metadata fields for maintenance/update time, and custom metadata for various media types.
Real-Time Financial Data Lake Application
The architecture comprises six functional layers: data sources (supporting structured, semi‑structured, and unstructured data), unified data ingestion, storage (cold, warm, hot intelligent distribution), data development (task creation, scheduling, monitoring, visual programming), data services (interactive queries, APIs, SQL quality assessment, metadata and lineage management), and data applications (digital marketing, risk control, operations, customer profiling).
Logically, the real‑time financial data lake consists of four layers: storage (MPP warehouse and OSS/HDFS lake with intelligent management), compute (unified metadata service), service (federated query engine and data service APIs enabling cross‑database queries), and product (intelligent services such as RPA, document recognition, language analysis, recommendation, as well as business analytics and data development platforms).
In the real‑time scenario, data from sources is ingested into Kafka, processed by Flink, and written to the lake built on open‑source components (HDFS/S3 storage, Iceberg table format). Processed results can be queried via engines like Flink, Spark, or Presto.
Soul's Delta Lake Data Lake Application Practice
Data from various endpoints is reported to Kafka, then Spark jobs write minute‑level Delta files to HDFS; Hive automatically creates mapping tables for Delta, allowing queries via Hive MR, Tez, Presto, etc.
Soul built a generic ETL tool on Spark that enables configuration‑driven data ingestion without coding. Key features include:
Hidden partition functionality similar to Iceberg, allowing creation of new partition columns (e.g., extracting year from a date column).
Regex validation for dynamic partitions to filter dirty data (e.g., rejecting non‑ASCII partition values).
Custom event‑time field selection for proper partitioning and avoiding data drift.
Configurable nested JSON parsing depth, flattening nested fields into single columns.
SQL‑based dynamic partition configuration to mitigate data skew and improve real‑time task performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
