Designing and Planning a Data Lake on Azure Data Lake Storage Gen2
This article provides a comprehensive guide to planning, structuring, securing, and managing a data lake on Azure Data Lake Storage Gen2, covering zone architecture, folder hierarchy, access control, file formats, scalability considerations, and best‑practice recommendations for big‑data workloads.
Building a data lake can feel overwhelming at first, but many decisions—such as lake architecture, file formats, number of lakes, and security—can be refined over time through experimentation and iteration.
The article begins with an overview of data lake planning, emphasizing the importance of structure, governance, and security based on the lake's scale and complexity. It advises considering what data will be stored, how it will arrive, be transformed, and who will access it, as well as long‑term access‑control strategies.
Four logical zones are described:
Raw zone : immutable, source‑system‑organized data stored in its original format (e.g., JSON, CSV) or columnar compressed formats like Avro, Parquet, or Delta Lake. Lifecycle management can move data to cooler tiers.
Cleansed zone : filtered data where columns are removed, data types are standardized, and enrichment may occur. Organization is often business‑driven.
Curated zone : consumption layer optimized for analytics, typically stored as denormalized data marts or star schemas using tools like Spark or Data Factory.
Laboratory zone : sandbox for data scientists and engineers to prototype and experiment, with read‑write permissions scoped to teams or projects.
A visual diagram of concepts, tools, and personas in the data lake is provided, noting that a separate sensitive zone may be required for restricted data.
The recommended folder hierarchy should be simple yet expressive, using human‑readable naming, appropriate permission granularity, partitioning strategies, and consistent schemas per folder. Example paths include:
\Raw\DataSource\Entity\YYYY\MM\DD\File.extensionWhen scaling, consider multiple storage accounts or subscriptions to handle high request rates (up to 20,000 req/s) and avoid throttling.
Access control in ADLS Gen2 relies on hierarchical namespace (HNS) with both RBAC (account‑level) and ACLs (folder/file‑level). ACLs should be assigned to groups rather than individuals to stay within the 32‑entry limit per object.
Managing permissions is best done via version‑controlled scripts rather than the Azure Storage Explorer UI, ensuring execute rights are granted on every parent folder.
File format selection balances storage cost, performance, and tooling. Parquet is the preferred columnar format for most analytics workloads, while Avro or compressed JSON may be used in the raw zone. Large files (e.g., >4 MB) are more cost‑effective than many small files.
The conclusion stresses that there is no one‑size‑fits‑all design; organizations should start simple, iterate, and align lake architecture with ingestion, consumption, security, and governance requirements to avoid a “data swamp.”
An appendix lists ADLS Gen2 limits (e.g., 5 PiB per zone, 20 k requests/sec, 32 ACL entries) and points to further documentation.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.