Big Data 17 min read

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

This article explains how Baidu’s next‑generation data platform Turing 3.0 integrates Apache Iceberg to solve the inefficiencies of the legacy MEG stack, detailing ecosystem components, migration strategies from Hive, table‑level optimizations, and future roadmap for high‑frequency, low‑latency analytics.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

Overview of the Turing 3.0 Ecosystem

The previous generation of Baidu MEG’s big‑data products suffered from platform fragmentation and poor usability, leading to low development efficiency, high learning costs, and slow business response. To address these issues, Baidu built the Turing 3.0 ecosystem, which includes the Turing Data Engine (TDE) compute & storage engine, Turing Data Studio (TDS) data‑development governance platform, and Turing Data Analysis (TDA) visual BI product.

Iceberg, an open‑source data‑lake table format, is adopted within this ecosystem to improve real‑time data ingestion, historical table updates, and overall data‑management efficiency.

Core Components

TDE (Turing Data Engine) : Spark‑based compute engine that processes data using Hive and Iceberg, plus a ClickHouse high‑performance engine.

TDS (Turing Data Studio) : One‑stop data‑development and governance platform.

TDA (Turing Data Analysis) : Next‑generation visual BI tool.

The article focuses on the application and practice of Iceberg within the Turing 3.0 ecosystem.

Why Iceberg?

Hive‑based data warehouses in MEG face three main problems: costly full‑table rewrites for incremental updates, limited real‑time update capabilities, and poor query performance due to metadata loading and file‑system scans. Iceberg provides row‑level updates, minute‑level data freshness, full ACID transactions, file‑based metadata management, and time‑travel capabilities.

Feature Comparison

Key differences between Hive and Iceberg:

Row‑level update: Hive – not supported; Iceberg – supports

MERGE INTO

and

UPSERT

. Timeliness: Hive – hour/day level; Iceberg – minute level. Transaction: Hive – partial ACID; Iceberg – full ACID with snapshot isolation. Metadata: Hive – stored in MySQL; Iceberg – stored alongside data files. Version control: Hive – none; Iceberg – supports time‑travel via snapshots.

Iceberg Architecture

Iceberg organizes files into a metadata layer (version‑hint, metadata file, snapshot/manifest‑list, manifest file) and a data layer (Parquet data files). This structure enables efficient metadata queries and versioned data access.

Iceberg architecture
Iceberg architecture

Migration from Hive to Iceberg

Two migration approaches are presented:

Method 1 – CALL migrate : Uses Iceberg’s

CALL catalog_name.system.migrate('db.sample', map('foo','bar'))

to convert a Hive table in place. Simple and reversible but renames the original Hive table and its data path, causing downstream read failures and mount conflicts.

Method 2 – Metadata‑only migration : Builds Iceberg metadata that reuses the existing Hive partitions, keeping the original data path unchanged. After metadata construction, data is validated, and the table property is switched to Iceberg, allowing seamless read/write with unchanged table names.

Migration workflow
Migration workflow

Iceberg Performance Optimizations

Two update strategies are used:

COW (Copy‑On‑Write) : Fast reads, slower writes; suited for read‑heavy workloads.

MOR (Merge‑On‑Read) : Faster writes, slower reads; supports Equality Delete and Position Delete files for different delete patterns.

Choosing the appropriate strategy improves query latency and write throughput based on workload characteristics.

Update strategy comparison
Update strategy comparison

Lifecycle Management & Optimization

Iceberg tables are managed through automated tasks that handle partition expiration, snapshot cleanup (time‑based and count‑based), and orphan file removal. Performance‑boosting operators include small‑file compaction and Z‑order sorting to improve query speed.

Lifecycle management flow
Lifecycle management flow

Future Plans

Iceberg will continue to expand its capabilities—intelligent governance, query optimization, smart indexing, and broader business coverage—while maintaining low‑cost migration paths and high‑performance analytics across Baidu’s data platforms.

Future roadmap
Future roadmap
Big DataData LakeApache IcebergTable FormatHive MigrationTuring Data Engine
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.