Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance
This article explains the origins and purpose of data lakes, outlines four key construction goals, details common ingestion methods and storage technologies, and describes essential governance practices such as cataloging, data quality, and regulatory compliance.
Origin and Role of Data Lakes
Data lakes were introduced to store massive raw logs and unstructured data that traditional relational warehouses could not handle. They retain the original data from OLTP systems, server logs, IoT devices, and third‑party sources without predefined schemas, enabling later exploration by data scientists and engineers. In a data‑mid platform, the lake sits upstream of data warehouses and data marts, providing raw inputs for complex PB‑scale SQL queries and, in some cases, serving data directly to applications.
Four Core Goals of Data‑Lake Construction
Comprehensive Ingestion – Capture and store as many useful data items as possible (ODS snapshots, CDC streams, logs, dynamic event data, third‑party datasets) to preserve business reproducibility.
Support for Data Warehouses – Provide raw, PB‑scale data that can be queried with engines such as Hive, Presto, or Impala, feeding downstream warehouses.
Data Exploration & Sharing – Enable analysts and engineers to run ad‑hoc SQL, discover patterns, and share results across the organization.
Machine‑Learning Foundation – Offer a unified, enterprise‑wide data source for training and serving ML models.
To meet these goals a lake must ensure data‑source comprehensiveness, secure accessibility, timeliness, and tool‑set diversity.
Ingestion Methods and Storage Technologies
Typical Sources
ODS (Operating Data Store) – periodic snapshots or CDC‑based captures of OLTP tables.
Server logs – HTTP access logs, error logs, audit trails.
Dynamic data – recommendation outputs, click‑stream events, user‑behavior “buried points”.
Third‑party data – credit scores, ad‑click data, app‑store download statistics.
Common Ingestion Techniques
Batch extraction with Sqoop or DataX (full or incremental).
Real‑time streaming via Kafka and Kafka Connect.
Log collection using Flume or Logstash.
Web crawlers for site‑wide data.
HTTP/Web‑service APIs with custom client scripts.
Storage Options
HDFS – General‑purpose file system for logs and bulk files.
Hive – Structured storage for ODS and relational imports.
Key‑Value Stores – Cassandra, HBase, ClickHouse for low‑latency reads/writes.
Document Stores – MongoDB, Couchbase for semi‑structured JSON documents.
Graph Stores – Neo4j, JanusGraph for relationship‑heavy workloads.
Object Stores – Ceph, Amazon S3 for large immutable objects (images, videos, backups).
Non‑Functional Requirements
Scalability – Elastic expansion across clusters or hybrid‑cloud environments; consider multi‑data‑center designs to avoid frequent “move‑outs”.
High Availability – Replicated storage (e.g., HDFS replication factor ≥3, Ceph/GlusterFS) to guarantee continuous access.
Storage Efficiency – Use columnar formats (ORC, Parquet) and compression (LZO, Snappy) to achieve 1:6–1:7 ratios; columnar storage also improves query pruning.
Durability – Immutable writes, multi‑site backup, and disaster‑recovery planning.
Security – End‑to‑end encryption at rest and in transit, fine‑grained IAM, and compliance‑driven immutability.
Small‑File Handling – Consolidate sub‑128 MB files into larger containers (e.g., HAR files) or use sequence files to reduce NameNode memory pressure.
Governance & Auditing – Central metadata catalog, data‑quality checks, and audit logs for regulatory compliance.
Data‑Lake Governance
Even though a lake stores raw data, governance is mandatory to keep the data usable, trustworthy, and compliant.
Metadata Catalog – Maintain a central catalog (e.g., Apache Atlas, AWS Glue) that records schema, lineage, and access permissions for all lake assets.
Data Quality – Implement schema validation on ODS loads, enforce consistency rules, and consider Delta Lake or Iceberg for ACID‑like guarantees.
Compliance – Apply masking or encryption for personally identifiable information (PII) to satisfy GDPR, HIPAA, ISO‑27001, etc.; embed privacy checks in the ingestion pipeline.
By integrating these governance components from the outset, a data lake can support both exploratory analytics and production workloads while meeting legal and security obligations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
