Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices
This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.
The article introduces a data lake as a unified storage pool that integrates multiple data sources and computation engines, aiming to simplify architecture, improve development efficiency, and reduce operational overhead.
Overall Architecture – A stream‑batch integrated design based on Hudi and Flink replaces the complex Lambda architecture, covering data collection, ETL, scheduling, monitoring, metadata management, and permission control. Data originates from MySQL, MongoDB, Tablestore, and Hana, and is stored in Hudi and Doris, serving OLAP, machine learning, API, and BI workloads via Kyuubi.
Ingestion Options – Three schemes are compared:
Flink SQL with MySQL‑CDC and Hudi connectors (per‑table binlog threads).
Dinky‑based whole‑database sync, merging sources into a single node before routing to Hudi tables.
Spark‑based DeltaStreamer supporting schema evolution, using Debezium and Confluent Schema Registry.
Each scheme’s advantages, drawbacks, and suitable scenarios are summarized, with a decision flowchart guiding selection based on compute engine, data volume, and schema stability.
Real‑time Ingestion Optimization – The pipeline reads Pulsar streams via Flink Stream API and writes to Hudi MOR tables using PartialUpdateAvroPayload . Key challenges include index choice, compaction bottlenecks, duplicate files, and pending compaction plans. Optimizations applied:
Switch from Bloom to Bucket index to improve write throughput.
Set compact_commit Parallelism = 1 to avoid compaction stalls.
Use Hudi‑CLI repair deduplicate commands to clean duplicate partitions.
Rollback problematic compaction tasks and run offline compaction with sh bin/hudi-compactor.sh and compaction unschedule commands.
Tune Flink resources (jobmanager 5G, taskmanager 50G, slot count), write rate limits, batch sizes, and compaction memory to balance latency and resource usage.
These adjustments resulted in stable long‑running jobs, minute‑level checkpoints, and reduced YARN resource consumption.
Query on the Data Lake – Before Kyuubi, queries were executed via disparate clients (JDBC, Beeline, Spark/Flink clients) leading to fragmented experiences. After integrating Kyuubi, a unified entry point with LDAP authentication, Ranger‑based authorization, and metadata catalog integration was built, enabling cross‑engine SQL queries, metadata‑driven table creation, and support for both batch and streaming workloads.
The article concludes with a summary of the optimization steps and a roadmap for further improvements.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.