Big Data 17 min read

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

This article explains how Baidu’s next‑generation data platform Turing 3.0 integrates Apache Iceberg to solve the inefficiencies of the legacy MEG stack, detailing ecosystem components, migration strategies from Hive, table‑level optimizations, and future roadmap for high‑frequency, low‑latency analytics.

Baidu Geek Talk

Jun 30, 2025

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

Overview of the Turing 3.0 Ecosystem

The previous generation of Baidu MEG’s big‑data products suffered from platform fragmentation and poor usability, leading to low development efficiency, high learning costs, and slow business response. To address these issues, Baidu built the Turing 3.0 ecosystem, which includes the Turing Data Engine (TDE) compute & storage engine, Turing Data Studio (TDS) data‑development governance platform, and Turing Data Analysis (TDA) visual BI product.

Iceberg, an open‑source data‑lake table format, is adopted within this ecosystem to improve real‑time data ingestion, historical table updates, and overall data‑management efficiency.

Core Components

TDE (Turing Data Engine) : Spark‑based compute engine that processes data using Hive and Iceberg, plus a ClickHouse high‑performance engine.

TDS (Turing Data Studio) : One‑stop data‑development and governance platform.

TDA (Turing Data Analysis) : Next‑generation visual BI tool.

The article focuses on the application and practice of Iceberg within the Turing 3.0 ecosystem.

Why Iceberg?

Hive‑based data warehouses in MEG face three main problems: costly full‑table rewrites for incremental updates, limited real‑time update capabilities, and poor query performance due to metadata loading and file‑system scans. Iceberg provides row‑level updates, minute‑level data freshness, full ACID transactions, file‑based metadata management, and time‑travel capabilities.

Feature Comparison

Key differences between Hive and Iceberg:

Row‑level update: Hive – not supported; Iceberg – supports MERGE INTO and UPSERT. Timeliness: Hive – hour/day level; Iceberg – minute level. Transaction: Hive – partial ACID; Iceberg – full ACID with snapshot isolation. Metadata: Hive – stored in MySQL; Iceberg – stored alongside data files. Version control: Hive – none; Iceberg – supports time‑travel via snapshots.

Iceberg Architecture

Iceberg organizes files into a metadata layer (version‑hint, metadata file, snapshot/manifest‑list, manifest file) and a data layer (Parquet data files). This structure enables efficient metadata queries and versioned data access.

Migration from Hive to Iceberg

Two migration approaches are presented:

Method 1 – CALL migrate : Uses Iceberg’s

CALL catalog_name.system.migrate('db.sample', map('foo','bar'))

to convert a Hive table in place. Simple and reversible but renames the original Hive table and its data path, causing downstream read failures and mount conflicts.

Method 2 – Metadata‑only migration : Builds Iceberg metadata that reuses the existing Hive partitions, keeping the original data path unchanged. After metadata construction, data is validated, and the table property is switched to Iceberg, allowing seamless read/write with unchanged table names.

Iceberg Performance Optimizations

Two update strategies are used:

COW (Copy‑On‑Write) : Fast reads, slower writes; suited for read‑heavy workloads.

MOR (Merge‑On‑Read) : Faster writes, slower reads; supports Equality Delete and Position Delete files for different delete patterns.

Choosing the appropriate strategy improves query latency and write throughput based on workload characteristics.

Lifecycle Management & Optimization

Iceberg tables are managed through automated tasks that handle partition expiration, snapshot cleanup (time‑based and count‑based), and orphan file removal. Performance‑boosting operators include small‑file compaction and Z‑order sorting to improve query speed.

Future Plans

Iceberg will continue to expand its capabilities—intelligent governance, query optimization, smart indexing, and broader business coverage—while maintaining low‑cost migration paths and high‑performance analytics across Baidu’s data platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data lake Apache Iceberg Table Format Hive Migration Turing Data Engine

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview of the Turing 3.0 Ecosystem

Core Components

Why Iceberg?

Feature Comparison

Iceberg Architecture

Migration from Hive to Iceberg

Iceberg Performance Optimizations

Lifecycle Management & Optimization

Future Plans

Baidu Geek Talk

How this landed with the community

Was this worth your time?

0 Comments

Overview of the Turing 3.0 Ecosystem