Big Data 12 min read

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

DataFunTalk

Dec 27, 2023

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

Author Introduction Xie Yi, senior data development engineer at NetEase Youdao, focuses on real‑time computing and lakehouse development. Wang Tao, senior platform engineer, works on big data and lakehouse platform construction.

Business Background Youdao’s data architecture consists of offline (Hive, Spark) and real‑time (Flink + Doris) layers. The real‑time layer faces high development and operation costs, limited support for full‑incremental reads, duplicated storage, expensive SSD storage for Doris, and unstable data sync between Hive and Doris.

Introducing Amoro Mixed Hive Amoro Mixed Hive provides Hive‑compatible read/write with self‑optimizing capabilities, offering two freshness levels: minute‑level via merge‑on‑read and hour‑level via direct Hive reads. It enables seamless migration of Hive tables to Mixed Hive without data rewrite, supporting ACID, upsert, and time‑travel.

Schema, partition, and types are consistent with Hive.

Hive connector treats Mixed Hive tables as regular Hive tables.

Zero‑downtime upgrade of Hive tables to Mixed Hive.

Lakehouse features such as primary‑key upsert, streaming read/write, ACID, and time‑travel.

Implementation Plan The migration includes upgrading data pipelines from Flink SQL to a real‑time lake platform, building CDC and log‑based ingestion to Mixed Hive, replacing Doris, and enabling minute‑level reporting. This reduces development cost by 65% and operation cost by 40%.

Real‑Time Data Lake Platform Co‑Construction A platform abstracts the underlying storage changes, offering one‑click Hive upgrade and end‑to‑end data ingestion, lowering the learning curve for developers.

Query Optimization Three optimizations were applied to Trino queries on Mixed Hive: rewriting the query plan to fetch sequence numbers efficiently, finer‑grained task splitting for delete files to mitigate data skew, and direct HDFS access to reduce RPC latency, resulting in up to 92% query time reduction.

Community Contributions Youdao has contributed 13 PRs to the Amoro open‑source project, including ORC support for Mixed Table Format, Flink DDL enhancements, Trino Hadoop‑proxy support, optimizer group creation via HTTP, delete‑table optimizations, and Flink optimizer failover reporting.

Future Plans Continue improving query performance (e.g., Z‑order in full optimizations), increase resource utilization by sharing Flink session resources for ingestion tasks, and integrate Paimon tables via Amoro’s Unified Catalog to extend lakehouse capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Real-time Data Hive Lakehouse Amoro Mixed Hive

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.