Data Lake and Data Warehouse Architectures: Expert Insights from Industry Leaders
The article summarizes a roundtable discussion where experts compare four lake‑warehouse architectural patterns, explain their suitability for different business scenarios, contrast them with traditional data warehouses, and highlight practical considerations for choosing and evolving data platforms.
This excerpt is taken from a roundtable on data lake technology maturity, organized with experts from Kuaishou, former Tencent, Ping An Insurance and others.
Host Jin Guowei asks about the four lake‑warehouse architectures—lake‑warehouse integration, lake‑on‑warehouse, warehouse‑on‑lake, and lake‑warehouse fusion—and invites Shao Saisei to explain each model and its core problem solutions and applicable scenarios.
Shao Saisei explains that the choice of a lake‑warehouse pattern is determined by the existing data platform architecture.
For many internet companies that built their data platforms on Hadoop, the Hadoop ecosystem acts as a data lake where all data resides in HDFS or object storage. When warehouse features such as schema evolution, snapshots, or time‑travel are needed, table‑format technologies like Iceberg or Hudi can be introduced on top of the lake, creating a “lake‑on‑warehouse” architecture.
Traditional enterprises, especially large ones in finance, often rely on commercial data warehouses such as Teradata or Snowflake, which provide integrated storage and compute. When these companies need additional scenarios like interactive queries or machine learning, they may augment the warehouse with data‑lake formats, forming a “warehouse‑on‑lake” approach. Snowflake now supports Iceberg catalogs, and systems like StarRocks also enable similar extensions.
This hybrid approach allows enterprises to move beyond pure reporting to support machine learning and data exploration workloads.
The ideal goal is a fully integrated lake‑warehouse architecture where a single data system satisfies both warehouse and lake requirements; the path chosen depends on the organization’s current platform and technology stack.
Jin Guowei then asks about the differences between current lake‑warehouse architectures and traditional warehouses.
Shao points out that traditional warehouses offer extensive native database capabilities, including transaction processing, high‑performance CRUD operations, optimized storage layouts, caching mechanisms, and rich indexing structures that accelerate queries.
In contrast, data lakes can provide some warehouse‑like features such as UPSERT, DELETE, time‑travel, and schema evolution, but they lack many enterprise‑grade services such as caching layers, indexing, and statistical metadata, leading to performance gaps. To bridge this gap, additional work is required at the engine level, including faster query engines and enhanced indexing and caching.
The discussion references additional Q&A articles that further explore data‑lake scenarios and selection criteria.
In conclusion, the choice of architecture is closely tied to a company’s existing data platform and technology decisions; there is no perfect architecture, only the most suitable one for the given context.
DataFun announces its new "Data Lake Practical Workshop," which details various data architecture constructions and usage scenarios to help practitioners make informed selections, and invites readers to inquire for more information.
Images illustrating the discussion are included in the original source.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.