Big Data 12 min read

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

DataFunTalk
DataFunTalk
DataFunTalk
Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

Author Introduction Xie Yi, senior data development engineer at NetEase Youdao, focuses on real‑time computing and lakehouse development. Wang Tao, senior platform engineer, works on big data and lakehouse platform construction.

Business Background Youdao’s data architecture consists of offline (Hive, Spark) and real‑time (Flink + Doris) layers. The real‑time layer faces high development and operation costs, limited support for full‑incremental reads, duplicated storage, expensive SSD storage for Doris, and unstable data sync between Hive and Doris.

Introducing Amoro Mixed Hive Amoro Mixed Hive provides Hive‑compatible read/write with self‑optimizing capabilities, offering two freshness levels: minute‑level via merge‑on‑read and hour‑level via direct Hive reads. It enables seamless migration of Hive tables to Mixed Hive without data rewrite, supporting ACID, upsert, and time‑travel.

Schema, partition, and types are consistent with Hive.

Hive connector treats Mixed Hive tables as regular Hive tables.

Zero‑downtime upgrade of Hive tables to Mixed Hive.

Lakehouse features such as primary‑key upsert, streaming read/write, ACID, and time‑travel.

Implementation Plan The migration includes upgrading data pipelines from Flink SQL to a real‑time lake platform, building CDC and log‑based ingestion to Mixed Hive, replacing Doris, and enabling minute‑level reporting. This reduces development cost by 65% and operation cost by 40%.

Real‑Time Data Lake Platform Co‑Construction A platform abstracts the underlying storage changes, offering one‑click Hive upgrade and end‑to‑end data ingestion, lowering the learning curve for developers.

Query Optimization Three optimizations were applied to Trino queries on Mixed Hive: rewriting the query plan to fetch sequence numbers efficiently, finer‑grained task splitting for delete files to mitigate data skew, and direct HDFS access to reduce RPC latency, resulting in up to 92% query time reduction.

Community Contributions Youdao has contributed 13 PRs to the Amoro open‑source project, including ORC support for Mixed Table Format, Flink DDL enhancements, Trino Hadoop‑proxy support, optimizer group creation via HTTP, delete‑table optimizations, and Flink optimizer failover reporting.

Future Plans Continue improving query performance (e.g., Z‑order in full optimizations), increase resource utilization by sharing Flink session resources for ingestion tasks, and integrate Paimon tables via Amoro’s Unified Catalog to extend lakehouse capabilities.

big dataFlinkreal-time dataHivelakehouseAmoroMixed Hive
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.