Big Data 21 min read

Designing and Planning a Data Lake on Azure Data Lake Storage Gen2

This article provides a comprehensive guide to planning, structuring, securing, and managing a data lake on Azure Data Lake Storage Gen2, covering zone architecture, folder hierarchy, access control, file formats, scalability considerations, and best‑practice recommendations for big‑data workloads.

Architects Research Society

Jun 10, 2023

Designing and Planning a Data Lake on Azure Data Lake Storage Gen2

Building a data lake can feel overwhelming at first, but many decisions—such as lake architecture, file formats, number of lakes, and security—can be refined over time through experimentation and iteration.

The article begins with an overview of data lake planning, emphasizing the importance of structure, governance, and security based on the lake's scale and complexity. It advises considering what data will be stored, how it will arrive, be transformed, and who will access it, as well as long‑term access‑control strategies.

Four logical zones are described:

Raw zone : immutable, source‑system‑organized data stored in its original format (e.g., JSON, CSV) or columnar compressed formats like Avro, Parquet, or Delta Lake. Lifecycle management can move data to cooler tiers.

Cleansed zone : filtered data where columns are removed, data types are standardized, and enrichment may occur. Organization is often business‑driven.

Curated zone : consumption layer optimized for analytics, typically stored as denormalized data marts or star schemas using tools like Spark or Data Factory.

Laboratory zone : sandbox for data scientists and engineers to prototype and experiment, with read‑write permissions scoped to teams or projects.

A visual diagram of concepts, tools, and personas in the data lake is provided, noting that a separate sensitive zone may be required for restricted data.

The recommended folder hierarchy should be simple yet expressive, using human‑readable naming, appropriate permission granularity, partitioning strategies, and consistent schemas per folder. Example paths include:

\Raw\DataSource\Entity\YYYY\MM\DD\File.extension

When scaling, consider multiple storage accounts or subscriptions to handle high request rates (up to 20,000 req/s) and avoid throttling.

Access control in ADLS Gen2 relies on hierarchical namespace (HNS) with both RBAC (account‑level) and ACLs (folder/file‑level). ACLs should be assigned to groups rather than individuals to stay within the 32‑entry limit per object.

Managing permissions is best done via version‑controlled scripts rather than the Azure Storage Explorer UI, ensuring execute rights are granted on every parent folder.

File format selection balances storage cost, performance, and tooling. Parquet is the preferred columnar format for most analytics workloads, while Avro or compressed JSON may be used in the raw zone. Large files (e.g., >4 MB) are more cost‑effective than many small files.

The conclusion stresses that there is no one‑size‑fits‑all design; organizations should start simple, iterate, and align lake architecture with ingestion, consumption, security, and governance requirements to avoid a “data swamp.”

An appendix lists ADLS Gen2 limits (e.g., 5 PiB per zone, 20 k requests/sec, 32 ACL entries) and points to further documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data storage architecture Data Governance Azure ADLS Gen2

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.