How Taikang Life Built a Scalable Lakehouse with Apache Hudi for Big Health Data
This article details Taikang Life's end‑to‑end design and implementation of a lakehouse‑style distributed data platform built on Apache Hudi, covering background, technical selection, architecture, custom Hudi extensions for the health insurance domain, performance benchmarks, real‑world results, and future work.
Abstract
This paper presents Taikang Life's technical selection, overall architecture design, and implementation of a lakehouse‑integrated distributed data processing platform based on Apache Hudi, including domain‑specific extensions for the big‑health sector and a brief overview of practice insights and application outcomes.
Company Profile
Taikang Life Insurance Co., Ltd. is a national joint‑stock life insurer founded in 1996, headquartered in Beijing, and part of the Taikang Insurance Group. It has grown into a large insurance‑financial services group covering insurance, asset management, and health‑care, ranking in the Fortune Global 500 with assets exceeding CNY 28 trillion.
Background
In response to the national "Healthy China 2030" initiative and rapid growth of the big‑health industry, Taikang Life needed a modern data foundation to break data silos and support strategic initiatives. Existing on‑premise physical servers and commercial databases caused fragmented data and high management costs.
Concepts
Data Lake : A centralized repository that ingests and stores raw data of any type at scale.
Data Warehouse : An enterprise system that aggregates data from multiple sources for analytics, AI, and ML.
Data Lakehouse : A platform that combines the best aspects of data lakes and warehouses, providing fast ingestion, transaction support, and unified data services.
Technical Selection
Three open‑source lake components were evaluated: Apache Iceberg, Apache Hudi, and Delta Lake. Evaluation dimensions included community momentum, feature set, and performance benchmarks using an internal insurance‑policy dataset (~74 million rows, 183 GB).
Community Momentum
Metrics such as GitHub stars, forks, contributors, PRs, and issue activity showed Apache Hudi leading in community activity and Chinese contributor participation, followed by Iceberg, while Delta Lake exhibited slower growth.
Features
Feature comparison demonstrated that Apache Hudi best satisfied Taikang's functional requirements, especially for streaming ingestion, upserts, and Flink integration.
Performance
Benchmark tests on the insurance‑policy dataset indicated Delta Lake slightly ahead of Hudi, with Iceberg lagging behind both.
Selection Result
Active community with diverse contributors and strong development momentum.
Key lake features and seamless Flink compatibility.
Performance meeting core requirements.
Consequently, Apache Hudi was chosen as the lake component.
Lakehouse Architecture Design
The architecture consists of five layers:
Data Sources : Enterprise databases (e.g., IBM DB2) and messaging systems (Kafka) providing raw business data.
Data Processing : Primarily Apache Flink with auxiliary Spark jobs, leveraging Flink’s unified batch‑stream model and rich connectors.
Infrastructure : HDFS cluster for distributed storage and Taikang Cloud OSS for object storage.
Lake Platform : Apache Hudi as the streaming data‑lake layer, offering transactional support, upserts, and fast ingestion.
Data Modeling : Tables built on Hudi format, enabling multi‑field upserts without losing existing column values.
Data Access : Hive metastore (with Kerberos security), Trino, ClickHouse, and REST APIs for discovery, governance, and query.
Data Applications : BI, reporting, analytics, and other health‑insurance services.
Implementation Highlights
Version Selection
All component versions were carefully aligned to ensure compatibility and stability (details omitted for brevity).
Typical Problems & Solutions
Synchronizing Hudi table metadata to Hive Metastore to avoid governance overhead.
Mitigating small‑file explosion by using Hudi’s asynchronous clustering after bulk ingestion.
Enabling Kerberos authentication for Hudi by patching source code to support secure HDFS/Hive access.
Custom Hudi Extensions
Domain‑Specific Improvements
Three major extensions were developed to meet Taikang’s strategic "three‑loop" model:
Multi‑field partitioned upsert based on primary key, allowing real‑time integration of policy, pension, and medical data without overwriting unrelated columns.
Multi‑event‑time validation to guarantee the latest record is persisted regardless of ingestion order, ensuring 100 % data accuracy for time‑sensitive insurance workflows.
Application Scenario
The extensions were validated on the "real‑time policy acceptance" use case, which requires high‑throughput, low‑latency processing of policy, agent, and medical information.
Results
After deployment, the platform processes over 600 million policy updates daily, achieving 100 % data correctness and supporting more than 100 streaming jobs and 1 200 batch ETL tasks, with total stored data approaching 300 TB.
Future Work
Integrate additional components (ML, recommendation engines) while preserving usability.
Enhance monitoring, fault‑tolerance, and disaster‑recovery mechanisms.
Continue customizing Hudi for big‑health specific workloads (e.g., custom filters, payloads).
Summary
The paper describes the end‑to‑end construction of Taikang Life's lakehouse data platform, the strategic customizations of Apache Hudi for the big‑health domain, and the substantial operational benefits achieved, positioning the platform as a core enabler for the company's long‑term data‑driven strategy.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.