Big Data 14 min read

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

DataFunTalk

Oct 28, 2023

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

The article introduces a data lake as a unified storage pool that integrates multiple data sources and computation engines, aiming to simplify architecture, improve development efficiency, and reduce operational overhead.

Overall Architecture – A stream‑batch integrated design based on Hudi and Flink replaces the complex Lambda architecture, covering data collection, ETL, scheduling, monitoring, metadata management, and permission control. Data originates from MySQL, MongoDB, Tablestore, and Hana, and is stored in Hudi and Doris, serving OLAP, machine learning, API, and BI workloads via Kyuubi.

Ingestion Options – Three schemes are compared:

Flink SQL with MySQL‑CDC and Hudi connectors (per‑table binlog threads).

Dinky‑based whole‑database sync, merging sources into a single node before routing to Hudi tables.

Spark‑based DeltaStreamer supporting schema evolution, using Debezium and Confluent Schema Registry.

Each scheme’s advantages, drawbacks, and suitable scenarios are summarized, with a decision flowchart guiding selection based on compute engine, data volume, and schema stability.

Real‑time Ingestion Optimization – The pipeline reads Pulsar streams via Flink Stream API and writes to Hudi MOR tables using PartialUpdateAvroPayload. Key challenges include index choice, compaction bottlenecks, duplicate files, and pending compaction plans. Optimizations applied:

Switch from Bloom to Bucket index to improve write throughput.

Set compact_commit Parallelism = 1 to avoid compaction stalls.

Use Hudi‑CLI repair deduplicate commands to clean duplicate partitions.

Rollback problematic compaction tasks and run offline compaction with sh bin/hudi-compactor.sh and compaction unschedule commands.

Tune Flink resources (jobmanager 5G, taskmanager 50G, slot count), write rate limits, batch sizes, and compaction memory to balance latency and resource usage.

These adjustments resulted in stable long‑running jobs, minute‑level checkpoints, and reduced YARN resource consumption.

Query on the Data Lake – Before Kyuubi, queries were executed via disparate clients (JDBC, Beeline, Spark/Flink clients) leading to fragmented experiences. After integrating Kyuubi, a unified entry point with LDAP authentication, Ranger‑based authorization, and metadata catalog integration was built, enabling cross‑engine SQL queries, metadata‑driven table creation, and support for both batch and streaming workloads.

The article concludes with a summary of the optimization steps and a roadmap for further improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Data Lake Kyuubi Hudi Real-time Ingestion

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.