Why Traditional Data Warehouses Fail and How a Real‑Time Lakehouse Solves the Pain
This article analyzes the shortcomings of mainstream data‑warehouse and data‑lake architectures, explains the design of ByteDance's real‑time/offline unified lakehouse solution, and demonstrates its practical applications and future roadmap across streaming, multi‑dimensional analysis, and batch‑stream reuse scenarios.
Mainstream Data Warehouse Architectures
The dominant Lambda architecture uses separate real‑time and batch pipelines to provide low‑latency and comprehensive data, merging results for ad‑hoc queries. Its advantages are clear responsibility boundaries, fault tolerance, and complexity isolation, but it suffers from calculation, operation, and cost issues.
Clear responsibility boundaries: streaming handles incremental data, batch handles full data.
Fault tolerance: batch results can overwrite streaming results, fixing errors.
Complexity isolation: batch uses offline ready data, streaming handles more complex real‑time processing.
However, Lambda faces challenges such as data‑metric misalignment between batch and streaming, duplicated development and maintenance effort, and doubled resource costs.
Inconsistent results between real‑time and batch calculations.
Duplicated code for batch and streaming, increasing maintenance.
Separate storage and compute resources double costs.
These issues stem from non‑unified real‑time/offline computation and storage layers.
Data Lake Solutions
Hudi, an open‑source data‑lake framework, offers streaming source/sink capabilities, minute‑level data visibility, offline batch updates with Insert/Overwrite, Upsert/Delete, and good integration with Spark, Flink, and Presto.
Although Hudi unifies storage for real‑time and offline, minute‑level visibility still leaves room for optimization, preventing it from being a standard real‑time warehouse solution.
Lakehouse Unified Demand
A unified lakehouse storage must support high‑throughput batch reads (at least Hive‑level), second‑level low‑latency streaming writes (millions of RPS) with ExactlyOnce/AtLeastOnce semantics, and integration with multiple query engines.
Proposed Lakehouse Architecture
The design focuses on overall architecture, data distribution, data model, read/write mechanisms, and the BTS architecture.
Overall Architecture
ByteDance built an in‑memory service on top of a data lake to achieve high throughput, high concurrency, and second‑level latency. The architecture consists of a persistence layer reusing Hudi, a metadata layer managing tables, partitions, snapshots, and a service layer (BTS and TMS) handling memory‑accelerated reads/writes and background compaction.
Data Distribution
Physical distribution follows Hudi concepts: Table, Partition, FileGroup (base + log files), Block (in‑memory), WAL Log, and the relationship between tasks and blocks.
Data Model
Each lakehouse table provides two views: an Append‑Only incremental view for real‑time calculations and a snapshot view for offline batch processing. Incremental view records all changes; snapshot view retains the latest state per primary key.
Read/Write
Load separation isolates streaming (latency‑sensitive, accelerated by BTS) from batch (throughput‑oriented, directly accessing storage). Consistency guarantees ensure streaming writes do not block batch writes and vice versa.
BTS Architecture
BTS consists of a Master (Block Load Balancer, Block Metadata Manager, Transaction Manager) and Table Servers (Session Manager, DataService, Transaction Manager, MemStore, WAL). It provides RPC interfaces, column pruning, predicate push‑down, in‑memory caching, and durable WAL for recovery.
Practical Scenarios
Streaming Data Computation : Replacing complex component chains with Hudi lakehouse tables reduces dependencies, simplifies debugging, and enables low‑cost historical data replay.
Real‑time Multi‑dimensional Analysis : Eliminates the need for ClickHouse, stores data cheaply in Hudi, and serves queries via Presto, saving expensive OLAP resources.
Batch‑Stream Data Reuse : Sharing Hudi DWD tables between real‑time and offline warehouses removes duplicate compute/storage and speeds up data readiness.
Future Roadmap
Plans focus on engine performance (concurrent writes, multi‑WAL merging, async flush), stability (node health detection, multi‑region deployment, disaster recovery), and business features (Kafka‑like partitioning and consumer groups).
The lakehouse solution is already offered as the LAS (Lakehouse Analytics Service) on Volcano Engine, providing a serverless, Spark/Presto‑compatible platform for intelligent real‑time lakehouse deployments.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.