Big Data 45 min read

How Taikang Life Built a Scalable Lakehouse with Apache Hudi for Big Health Data

This article details Taikang Life's end‑to‑end design and implementation of a lakehouse‑style distributed data platform built on Apache Hudi, covering background, technical selection, architecture, custom Hudi extensions for the health insurance domain, performance benchmarks, real‑world results, and future work.

Data Thinking Notes

Oct 11, 2023

How Taikang Life Built a Scalable Lakehouse with Apache Hudi for Big Health Data

Abstract

This paper presents Taikang Life's technical selection, overall architecture design, and implementation of a lakehouse‑integrated distributed data processing platform based on Apache Hudi, including domain‑specific extensions for the big‑health sector and a brief overview of practice insights and application outcomes.

Company Profile

Taikang Life Insurance Co., Ltd. is a national joint‑stock life insurer founded in 1996, headquartered in Beijing, and part of the Taikang Insurance Group. It has grown into a large insurance‑financial services group covering insurance, asset management, and health‑care, ranking in the Fortune Global 500 with assets exceeding CNY 28 trillion.

Background

In response to the national "Healthy China 2030" initiative and rapid growth of the big‑health industry, Taikang Life needed a modern data foundation to break data silos and support strategic initiatives. Existing on‑premise physical servers and commercial databases caused fragmented data and high management costs.

Concepts

Data Lake : A centralized repository that ingests and stores raw data of any type at scale.

Data Warehouse : An enterprise system that aggregates data from multiple sources for analytics, AI, and ML.

Data Lakehouse : A platform that combines the best aspects of data lakes and warehouses, providing fast ingestion, transaction support, and unified data services.

Technical Selection

Three open‑source lake components were evaluated: Apache Iceberg, Apache Hudi, and Delta Lake. Evaluation dimensions included community momentum, feature set, and performance benchmarks using an internal insurance‑policy dataset (~74 million rows, 183 GB).

Community Momentum

Metrics such as GitHub stars, forks, contributors, PRs, and issue activity showed Apache Hudi leading in community activity and Chinese contributor participation, followed by Iceberg, while Delta Lake exhibited slower growth.

Features

Feature comparison demonstrated that Apache Hudi best satisfied Taikang's functional requirements, especially for streaming ingestion, upserts, and Flink integration.

Performance

Benchmark tests on the insurance‑policy dataset indicated Delta Lake slightly ahead of Hudi, with Iceberg lagging behind both.

Selection Result

Active community with diverse contributors and strong development momentum.

Key lake features and seamless Flink compatibility.

Performance meeting core requirements.

Consequently, Apache Hudi was chosen as the lake component.

Lakehouse Architecture Design

The architecture consists of five layers:

Data Sources : Enterprise databases (e.g., IBM DB2) and messaging systems (Kafka) providing raw business data.

Data Processing : Primarily Apache Flink with auxiliary Spark jobs, leveraging Flink’s unified batch‑stream model and rich connectors.

Infrastructure : HDFS cluster for distributed storage and Taikang Cloud OSS for object storage.

Lake Platform : Apache Hudi as the streaming data‑lake layer, offering transactional support, upserts, and fast ingestion.

Data Modeling : Tables built on Hudi format, enabling multi‑field upserts without losing existing column values.

Data Access : Hive metastore (with Kerberos security), Trino, ClickHouse, and REST APIs for discovery, governance, and query.

Data Applications : BI, reporting, analytics, and other health‑insurance services.

Implementation Highlights

Version Selection

All component versions were carefully aligned to ensure compatibility and stability (details omitted for brevity).

Typical Problems & Solutions

Synchronizing Hudi table metadata to Hive Metastore to avoid governance overhead.

Mitigating small‑file explosion by using Hudi’s asynchronous clustering after bulk ingestion.

Enabling Kerberos authentication for Hudi by patching source code to support secure HDFS/Hive access.

Custom Hudi Extensions

Domain‑Specific Improvements

Three major extensions were developed to meet Taikang’s strategic "three‑loop" model:

Multi‑field partitioned upsert based on primary key, allowing real‑time integration of policy, pension, and medical data without overwriting unrelated columns.

Multi‑event‑time validation to guarantee the latest record is persisted regardless of ingestion order, ensuring 100 % data accuracy for time‑sensitive insurance workflows.

Application Scenario

The extensions were validated on the "real‑time policy acceptance" use case, which requires high‑throughput, low‑latency processing of policy, agent, and medical information.

Results

After deployment, the platform processes over 600 million policy updates daily, achieving 100 % data correctness and supporting more than 100 streaming jobs and 1 200 batch ETL tasks, with total stored data approaching 300 TB.

Future Work

Integrate additional components (ML, recommendation engines) while preserving usability.

Enhance monitoring, fault‑tolerance, and disaster‑recovery mechanisms.

Continue customizing Hudi for big‑health specific workloads (e.g., custom filters, payloads).

Summary

The paper describes the end‑to‑end construction of Taikang Life's lakehouse data platform, the strategic customizations of Apache Hudi for the big‑health domain, and the substantial operational benefits achieved, positioning the platform as a core enabler for the company's long‑term data‑driven strategy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Performance Benchmark Lakehouse Apache Hudi Healthcare

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.