Big Data 10 min read

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

This article explains the origins and market growth of data lakes, compares them with traditional data warehouses, showcases major implementations like Amazon Galaxy and Club Factory, and provides practical guidance on choosing open‑source or commercial cloud solutions to construct a data lake efficiently while minimizing risk.

ITPUB
ITPUB
ITPUB
How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

What Is a Data Lake?

A data lake is a storage architecture that holds raw data of any format—structured, semi‑structured, or unstructured—without requiring upfront transformation. The concept was first introduced in 2011 and is built on the principle of separating storage from compute, typically using inexpensive object stores such as Hadoop Distributed File System (HDFS) or Amazon S3.

Data Lake vs. Data Warehouse

Data warehouses originate from relational databases and are optimized for structured data and predefined schemas. In contrast, a data lake can ingest any data type in its native form, enabling flexible analytics and lower storage costs. Query engines such as Apache Presto, Elasticsearch, and Amazon Athena can read directly from the lake, reducing the need for extensive ETL pipelines.

Market Scale

Global data‑lake market size was US$37.4 billion in 2019 and is projected to reach US$176 billion by 2025, representing a compound annual growth rate (CAGR) of 29.9 %.

Real‑World Deployments

Amazon Galaxy (internal code name) was launched in 2019. It uses Amazon S3 as the storage layer, Amazon Redshift and Redshift Spectrum for analytics, and Amazon EMR for large‑scale processing. Storage grew from 50 PB to 100 PB, supporting up to 600 000 daily analytical tasks across recommendation, operations, inventory, and pricing domains.

Club Factory , a Chinese cross‑border e‑commerce platform, built its data lake on AWS. It processes 1.5 billion user‑behavior events per day, supports more than 80 data engineers, runs 180 active analysis jobs, synchronizes over 4 000 data sets daily, and stores roughly 600 TB of data.

Approaches to Building a Data Lake

There are two primary approaches:

Open‑source solutions : No licensing fees, but require a skilled team to integrate fragmented components (e.g., HDFS, Hive, Presto, Airflow). Complexity and operational risk are high for first‑time adopters.

Commercial cloud solutions : Provide managed services, cost monitoring, mature tooling, scalability, and built‑in security. Public‑cloud offerings enable rapid construction with fewer operational overheads.

Quick‑Start Stack on AWS

A minimal, production‑ready data lake on AWS typically includes:

Amazon S3 for durable object storage.

AWS Glue for serverless ETL, schema discovery, and a centralized metadata catalog.

Amazon Athena for interactive, serverless SQL queries directly on data stored in S3.

AWS Lake Formation can further automate lake creation, permissions management, and data ingestion, allowing a secure data lake to be provisioned in days rather than months (note: not yet available in mainland China).

Example: Querying a Parquet Table with Athena

SELECT user_id,
       COUNT(*) AS event_cnt,
       DATE(event_timestamp) AS event_date
FROM "my_glue_catalog"."analytics"."user_events"
WHERE event_date BETWEEN DATE '2023-01-01' AND DATE '2023-01-31'
GROUP BY user_id, event_date
ORDER BY event_cnt DESC
LIMIT 10;

This query runs on Athena without provisioning any servers; you are billed only for the scanned data (e.g., $5 per terabyte).

AWS Glue and Amazon Athena Details

AWS Glue is a fully managed extract‑transform‑load (ETL) service that automatically discovers schemas, creates a persistent metadata catalog, and generates Spark‑based ETL scripts. It simplifies loading data into data warehouses, data lakes, or downstream analytics platforms.

Amazon Athena provides a serverless, ANSI‑SQL‑compatible query engine. It reads data directly from S3 in formats such as CSV, JSON, Parquet, ORC, and Avro. Because Athena leverages the Glue Data Catalog for schema information, you can query the same tables used by other AWS analytics services.

Regional Availability

As of the latest release, AWS Glue and Amazon Athena are available in the AWS China (Ningxia) region operated by West Cloud, enabling the same serverless data‑lake workflow for customers in mainland China.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datacloud computingData WarehouseAWSData LakeData Architecture
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.