Understanding Data Lakes: Definitions, Benefits, Architectures, and Technology Choices
Data lakes, emerging since 2020, are centralized repositories that store structured and unstructured data at any scale, offering flexible analytics, but require robust management to avoid becoming data swamps; this article explains definitions, advantages, typical architectures, and compares cloud and open‑source solutions such as AWS Lake Formation, Alibaba Cloud, Delta, Iceberg, and Hudi.
Just talking casually
Around mid‑2020 the term "data lake" started appearing frequently, but many people used it without fully understanding the concept.
The author notes that in China the trend is to hype buzzwords first, publish articles, and only later start the actual work.
Below are excerpts from Wikipedia and AWS, as well as Alibaba Cloud’s own description of a data lake.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi‑structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.
Data lake is a type of storage system that keeps data in its natural or raw format, typically as object blobs or files. It serves as a single repository for all enterprise data, including raw copies from source systems and transformed data for reporting, visualization, advanced analytics, and machine learning. It can store structured, semi‑structured, unstructured, and binary data. A poorly managed data lake becomes a “data swamp”.
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as‑is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real‑time analytics, and machine learning to guide better decisions.
Data lake is a centralized repository that can store any scale of structured and unstructured data, supporting big data and AI computing. The Data Lake Construction service (Data Lake Formation, DLF) is the core of a cloud‑native data lake architecture, providing unified metadata management, enterprise‑grade permission control, and seamless integration with multiple compute engines, breaking data silos and revealing business value.
The common theme is the ability to store data of any scale, both structured and unstructured.
Assuming the above definitions are correct, a data lake can be seen as a new generation that goes beyond:
Hadoop‑centric offline data warehouse (first stage)
Lambda‑based batch‑stream unified architecture (second stage)
Kappa‑based data consistency architecture (third stage)
These three approaches are already mature ways to build data centers or data warehouses.
Why are data lakes considered good?
AWS provides a clear answer:
Data warehouses are optimized databases for analyzing relational data from transactional systems and business applications. They require predefined schemas for fast SQL queries and serve as a trusted “single source of truth”. Data lakes differ because they store both relational data from business applications and non‑relational data from mobile apps, IoT devices, and social media. No schema is defined when data is ingested, allowing you to store everything without knowing future query needs. Various analytics (SQL, big‑data processing, full‑text search, real‑time analytics, machine learning) can be applied to gain insights.
The image highlights the concept of a “read‑time schema”. Traditional data warehouses require extensive design before building, while a data lake adopts a “read‑time schema” where data is stored first and schema is applied later as needed.
Because of this flexibility, strong data‑management capabilities are essential; otherwise a data lake degrades into a “data swamp”.
Possible Data Lake Architectures
The author shows a typical Alibaba Cloud demo diagram:
The architecture uses OSS as the central storage and connects to many compute engines, emphasizing the need for an extensible object‑storage‑based distributed file system (e.g., HDFS).
Other typical architectures from various companies are shown in the following images:
Caption: XiaoHongShu
Caption: Zhongyuan Bank
Caption: Tencent Data Platform
Technology Choices
How major vendors implement data‑lake solutions:
AWS
AWS introduced AWS Lake Formation in 2018; the solution can be deployed in minutes using AWS CloudFormation templates.
Alibaba Cloud
The solution also uses OSS as the central storage and offers a Data Lake Formation service similar to AWS.
Other cloud providers (Huawei, Tencent, etc.) offer comparable solutions.
Open‑source solutions
The three most prominent open‑source data‑lake frameworks are Delta, Apache Iceberg, and Apache Hudi. The author refers to a previous article for a detailed comparison.
Iceberg and Hudi are widely adopted because they are fully open source and have active communities.
My Confusion
The author’s confusion lies not in the data‑lake technology itself but in the varying definitions offered by different companies.
The term was originally coined by James Dixon, CTO of Pentaho, and today every vendor has its own definition without a clear consensus.
According to a quote from Alibaba Cloud’s “Jing Xuan” teacher, a mature data‑lake solution should be evaluated on its data‑management capabilities—metadata, data catalog, source integration, task orchestration, lifecycle, governance, permission control, and ecosystem integration.
Data lakes should not be viewed solely as a technical platform; the maturity of a solution depends on its data‑management features such as metadata, data catalog, data sources, processing tasks, lifecycle, governance, permission management, and integration with external ecosystems.
These capabilities are already covered by traditional real‑time or batch data warehouses, so the timing of adopting a data lake requires careful consideration.
In summary, the author suggests “learning now but not using immediately” and waiting for major vendors to fill the gaps before adopting a data‑lake approach.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
