Big Data 14 min read

Why Data Lakes Are Redefining Enterprise Data Architecture

This article explains the origins, core features, logical architecture, and advantages of data lakes, contrasts them with traditional data warehouses, outlines a modern data architecture that combines lakes and warehouses, and introduces the DCE intelligent data lake platform with practical Q&A.

dbaplus Community

Dec 26, 2016

Why Data Lakes Are Redefining Enterprise Data Architecture

Data Lake Concept

The term “Data Lake” first appeared in a 2011 Forbes article, highlighting that traditional data warehouses store pre‑processed, structured data, which cannot keep up with the massive volume, variety, and rapid growth of big‑data workloads. A data lake is a large‑scale storage and processing repository that can ingest any type of data and scale elastically, becoming a core component for scientific research and enterprise decision‑making.

Key characteristics of a data lake include:

Collect Everything – Ingest and store data of any format (structured, semi‑structured, unstructured) from both internal and external sources.

Dive In Anywhere – Provide a multi‑engine analytics platform where users can explore, refine, and analyze data across business units.

Flexible Access – Offer a shared infrastructure for storage and analysis, supporting diverse big‑data scenarios.

Logical Architecture of a Data Lake

All structured, semi‑structured, and unstructured data from internal and external sources are stored in the Persistent Layer.

Data scientists and analysts access the Persistent Layer via an Analytics Sandbox, creating curated data sets for business analysts.

Business analysts further refine curated data with the data‑management team, placing the results in the Operational Layer for broader consumption.

Differences Between Data Lakes and Data Warehouses

Data warehouses require extensive upfront work to model, understand business processes, and design highly structured schemas, focusing on current reporting needs and expensive storage. In contrast, data lakes store every piece of data—current, future, or even unused—allowing time‑travel analysis and cost‑effective scaling using commodity hardware.

Data warehouses primarily serve business users who need well‑structured reports, while data lakes also accommodate analysts, statisticians, and data scientists who require raw or semi‑processed data for deeper exploration.

Because data lakes retain data in its original form until needed, they can adapt more quickly to changing requirements, provide faster insight, and avoid the heavy engineering effort required to remodel a warehouse for new use cases.

Modern Data Architecture

A modern architecture combines the strengths of data lakes with traditional relational warehouses and OLAP engines, delivering fast queries, interactive analysis, and support for all levels of data consumers.

The architecture, illustrated by Hortonworks, consists of four layers (south‑to‑north):

Data Acquisition Layer – Extracts and moves data from relational systems, user‑generated sources, semi‑structured files, external feeds, and streaming data into the lake.

Data Curation Layer – Organizes, standardizes, masks, cleans, transforms, and manages data for downstream consumption.

Data Provisioning Layer – Stores curated data in traditional formats (OLAP, data warehouse, data marts) to simplify consumption and enable fast interactive queries.

Data Consumption Layer – Exposes interfaces for diverse users and tools, supporting a wide range of analytics and reporting needs.

DCE Intelligent Data Lake Platform

The DCE platform, built on container technology, adopts a “PaaS + Data Lake” dual‑engine model that bridges underlying hardware resources with top‑level applications. During peak business hours, compute resources are allocated to serve applications; at night, they shift to data‑processing workloads, achieving elastic, cost‑effective utilization.

By integrating cloud‑computing and big‑data technologies, the platform enhances both compute power (“gold‑panning technique”) and storage capacity (“sand amount”), shortening the path from data generation to value realization and driving continuous business innovation.

Q&A

Q1: What is the relationship between data lakes and stream processing? Can streaming be performed inside a data lake?

A1: A data lake includes distributed storage and processing engines; stream processing (e.g., Storm, Spark Streaming) is one of the capabilities that can run within the lake.

Q2: When querying MySQL, what are the side effects of using SET to assign variables before the query?

A2: Variable binding improves code reuse and readability and can reduce hard parsing. If MySQL has a mechanism similar to Oracle’s Shared Pool, the impact on performance is expected to be positive.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Cloud Computing Data Lake

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.