Big Data 7 min read

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

This article examines the use of Apache Hudi for building a hospital‑wide medical big‑data platform, covering construction background, reasons for selecting Hudi, data synchronization methods, storage mode choices, query optimizations, and future development considerations.

Big Data Technology Architecture

May 31, 2020

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

1. Construction Background

Our company builds a big‑data platform for hospitals, extracting data from many systems such as HIS, LIS, EMR, radiology, etc., which introduces challenges like heterogeneous source databases, unified data modeling, massive data volume variance, and real‑time requirements.

2. Why Choose Hudi

The previous solution used binlog → JSON → Kafka → Spark Streaming → HBase → DataX → Hadoop → Impala → Greenplum, which suffered from complex pipelines, difficult validation, storage redundancy, high query load, and latency.

Hudi was selected for its dual write modes (Copy‑On‑Write and Merge‑On‑Read), support for multiple query engines (Hive, Spark SQL, Presto, Impala), rich indexing (HBase, InMemory, Bloom, Global Bloom), and Parquet columnar storage with small‑file merging.

3. Hudi Data Synchronization

Data sync consists of offline full‑load using DataX with multithreaded JDBC extraction and online near‑real‑time sync where multiple tables write JSON to Kafka, Flink writes to HDFS partitions, and a service triggers Hudi merge jobs.

4. Storage Type Selection and Query Optimization

We adopted the Copy‑On‑Write mode to reduce query latency and leverage read‑optimized incremental views. Query performance is further tuned by Spark SQL partitioning, job parallelism, broadcast small tables, and avoiding data skew; Presto queries run about three times faster after enabling incremental view support.

5. Future Work and Thoughts

Plans include integrating FlinkX‑style offline sync, improving multi‑output Spark consumption, enhancing Hudi support in Flink, and deeper community involvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data synchronization presto Spark Apache Hudi Copy-on-Write Medical Big Data

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.