How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance
This article explains how Baidu’s next‑generation data platform Turing 3.0 integrates Apache Iceberg to solve the inefficiencies of the legacy MEG stack, detailing ecosystem components, migration strategies from Hive, table‑level optimizations, and future roadmap for high‑frequency, low‑latency analytics.
Overview of the Turing 3.0 Ecosystem
The previous generation of Baidu MEG’s big‑data products suffered from platform fragmentation and poor usability, leading to low development efficiency, high learning costs, and slow business response. To address these issues, Baidu built the Turing 3.0 ecosystem, which includes the Turing Data Engine (TDE) compute & storage engine, Turing Data Studio (TDS) data‑development governance platform, and Turing Data Analysis (TDA) visual BI product.
Iceberg, an open‑source data‑lake table format, is adopted within this ecosystem to improve real‑time data ingestion, historical table updates, and overall data‑management efficiency.
Core Components
TDE (Turing Data Engine) : Spark‑based compute engine that processes data using Hive and Iceberg, plus a ClickHouse high‑performance engine.
TDS (Turing Data Studio) : One‑stop data‑development and governance platform.
TDA (Turing Data Analysis) : Next‑generation visual BI tool.
The article focuses on the application and practice of Iceberg within the Turing 3.0 ecosystem.
Why Iceberg?
Hive‑based data warehouses in MEG face three main problems: costly full‑table rewrites for incremental updates, limited real‑time update capabilities, and poor query performance due to metadata loading and file‑system scans. Iceberg provides row‑level updates, minute‑level data freshness, full ACID transactions, file‑based metadata management, and time‑travel capabilities.
Feature Comparison
Key differences between Hive and Iceberg:
Row‑level update: Hive – not supported; Iceberg – supports
MERGE INTOand
UPSERT. Timeliness: Hive – hour/day level; Iceberg – minute level. Transaction: Hive – partial ACID; Iceberg – full ACID with snapshot isolation. Metadata: Hive – stored in MySQL; Iceberg – stored alongside data files. Version control: Hive – none; Iceberg – supports time‑travel via snapshots.
Iceberg Architecture
Iceberg organizes files into a metadata layer (version‑hint, metadata file, snapshot/manifest‑list, manifest file) and a data layer (Parquet data files). This structure enables efficient metadata queries and versioned data access.
Migration from Hive to Iceberg
Two migration approaches are presented:
Method 1 – CALL migrate : Uses Iceberg’s
CALL catalog_name.system.migrate('db.sample', map('foo','bar'))to convert a Hive table in place. Simple and reversible but renames the original Hive table and its data path, causing downstream read failures and mount conflicts.
Method 2 – Metadata‑only migration : Builds Iceberg metadata that reuses the existing Hive partitions, keeping the original data path unchanged. After metadata construction, data is validated, and the table property is switched to Iceberg, allowing seamless read/write with unchanged table names.
Iceberg Performance Optimizations
Two update strategies are used:
COW (Copy‑On‑Write) : Fast reads, slower writes; suited for read‑heavy workloads.
MOR (Merge‑On‑Read) : Faster writes, slower reads; supports Equality Delete and Position Delete files for different delete patterns.
Choosing the appropriate strategy improves query latency and write throughput based on workload characteristics.
Lifecycle Management & Optimization
Iceberg tables are managed through automated tasks that handle partition expiration, snapshot cleanup (time‑based and count‑based), and orphan file removal. Performance‑boosting operators include small‑file compaction and Z‑order sorting to improve query speed.
Future Plans
Iceberg will continue to expand its capabilities—intelligent governance, query optimization, smart indexing, and broader business coverage—while maintaining low‑cost migration paths and high‑performance analytics across Baidu’s data platforms.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.