Big Data 7 min read

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

This article examines the use of Apache Hudi for building a hospital‑wide medical big‑data platform, covering construction background, reasons for selecting Hudi, data synchronization methods, storage mode choices, query optimizations, and future development considerations.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

1. Construction Background

Our company builds a big‑data platform for hospitals, extracting data from many systems such as HIS, LIS, EMR, radiology, etc., which introduces challenges like heterogeneous source databases, unified data modeling, massive data volume variance, and real‑time requirements.

2. Why Choose Hudi

The previous solution used binlog → JSON → Kafka → Spark Streaming → HBase → DataX → Hadoop → Impala → Greenplum, which suffered from complex pipelines, difficult validation, storage redundancy, high query load, and latency.

Hudi was selected for its dual write modes (Copy‑On‑Write and Merge‑On‑Read), support for multiple query engines (Hive, Spark SQL, Presto, Impala), rich indexing (HBase, InMemory, Bloom, Global Bloom), and Parquet columnar storage with small‑file merging.

3. Hudi Data Synchronization

Data sync consists of offline full‑load using DataX with multithreaded JDBC extraction and online near‑real‑time sync where multiple tables write JSON to Kafka, Flink writes to HDFS partitions, and a service triggers Hudi merge jobs.

4. Storage Type Selection and Query Optimization

We adopted the Copy‑On‑Write mode to reduce query latency and leverage read‑optimized incremental views. Query performance is further tuned by Spark SQL partitioning, job parallelism, broadcast small tables, and avoiding data skew; Presto queries run about three times faster after enabling incremental view support.

5. Future Work and Thoughts

Plans include integrating FlinkX‑style offline sync, improving multi‑output Spark consumption, enhancing Hudi support in Flink, and deeper community involvement.

Data SynchronizationprestoSparkApache Hudicopy-on-writeMedical Big Data
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.