Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations
This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.
ADLS Gen2: When Is It the Right Choice for Your Data Lake?
Enterprise data lakes serve as a central repository for unstructured, semi‑structured, and structured data used in big‑data platforms. Their goal is to eliminate data silos and provide a single storage layer that meets diverse organizational data needs. For details on choosing the right storage solution, see the Azure article on selecting big‑data storage technologies.
A common question is when to use a data warehouse versus a data lake. Treat them as complementary: a data lake stores raw data from many sources, while a data warehouse holds highly structured data for specific analytics. Both can be used together to derive insights, build machine‑learning models, and support BI reporting.
ADLS Gen2 is an enterprise‑grade, hyperscale data store optimized for big‑data analytics. It offers hierarchical namespace, faster performance, Hadoop‑compatible access, fine‑grained ACLs, native Azure AD integration, and cost‑effective security. It is suited for workloads that ingest massive volumes (multiple petabytes) and require high throughput (hundreds of Gbps).
Key Design Considerations for a Data Lake
When building an enterprise data lake on ADLS Gen2, consider the following use‑case questions:
What data will be stored?
How much data will be stored?
Which parts of the data will be used for analytics?
Who needs access to which parts?
What analytics workloads will run?
What transaction patterns do the workloads exhibit?
What is the budget?
The remainder of this document follows a structure of available options with pros/cons, factors to consider, recommended patterns, and anti‑patterns.
Terminology
Key terms used throughout the guide assume you have an Azure subscription.
Resource : Manageable Azure objects such as VMs, storage accounts, or virtual networks.
Subscription : Logical container for managing resources and billing.
Resource group : Logical container for related Azure resources.
Storage account : Azure resource that holds blobs, files, queues, tables, and disks. ADLS Gen2 is a Blob storage account with hierarchical namespace enabled.
Container : Organizes a set of objects (or files). No limit on the number of containers per account.
Folder/Directory : Organizes objects within a container; supports ACLs (access and default).
Object/File : The actual data entity; has an access ACL only.
Organizing and Managing Data in the Lake
Customers often use a single logical data lake that may be implemented as one or multiple ADLS Gen2 accounts, depending on governance, billing, and isolation requirements.
Example scenario: a large retail customer (Contoso.com) builds a data‑lake strategy to support various predictive‑analytics use cases.
Centralized vs. Federated Implementation
Two models are available: a fully centralized lake managed by a single team, or a federated model where each business unit owns its own lake while a central team governs security and policies. Both can be realized with a single storage account or multiple accounts.
Key Considerations
Single account simplifies RBAC, firewall, and lifecycle management.
Multiple accounts enable data isolation, separate billing, and per‑unit governance.
Global Enterprise Data Lake
When data must be shared globally, distinguish between globally shared data and region‑restricted data (e.g., personal data subject to sovereignty). Use region‑specific accounts for restricted data while maintaining a logical central lake.
Customer‑Specific Isolation
Multi‑tenant scenarios may require separate lakes per customer to enforce distinct governance and cost models.
Recommendations
Create separate accounts for dev and prod (preferably in different subscriptions).
Identify logical data sets and decide on unified vs. isolated management.
Start with a single account and add more only when isolation or regional requirements demand it.
Remember that other resources (VM cores, Azure Data Factory instances) also have subscription limits.
Anti‑Patterns
Avoid excessive lake management : Do not provision more accounts than needed, as this adds operational overhead without ROI.
Uncontrolled data replication : Replicating data across accounts can create source‑of‑truth ambiguity and increase transaction costs.
Scalability assumptions : Verify that the account can handle petabyte‑scale data and high‑throughput workloads; contact Microsoft for support if you anticipate >10,000 cores or hundreds of Gbps.
How to Organize My Data?
Data can be organized hierarchically using containers, folders, and files. Common zones include:
Raw data : Ingested as‑is; used by data‑engineering pipelines for cleaning and enrichment.
Enriched data : Cleaned and combined with other sources; stored in formats like Parquet.
Curated data : High‑value, structured data for BI and data‑science consumption.
Workspace data : User‑provided datasets used alongside curated data for ad‑hoc analysis.
Archive data : Long‑term retention for compliance; stored in Cool or Archive tiers.
Key Considerations
Choose a folder structure that reflects data semantics and consumer access patterns (e.g., /raw/, /enriched/, /curated/, /workspace/).
Recommendations
Create separate containers or folders for each zone.
Within a zone, partition by logical attributes such as date, region, or business unit.
Consider access‑control models when designing folder hierarchies.
Consideration
Raw
Enriched
Curated
Workspace
Consumers
Data‑engineering team
Data‑engineering team + temporary data‑science/BI access
Data‑engineers, BI analysts, data scientists
Data scientists / BI analysts
Access control
Locked to engineering team
Full control for engineers, read for analysts
Full control for engineers, read/write for analysts
Full control for engineers and analysts
Lifecycle management
Move to cooler tier after enrichment
Move older data to cooler tier
Move older data to cooler tier
Apply policy‑driven cleanup (e.g., DLM)
Folder hierarchy
Reflect ingestion pattern
Reflect business‑unit organization
Reflect business‑unit organization
Reflect workspace team structure
Example paths
/raw/sensordata, /raw/lobappdata, /raw/userclickdata
/enriched/sales, /enriched/manufacturing
/curated/sales, /curated/manufacturing
/workspace/salesBI, /workspace/manufacturing, /workspace/datascience
Container vs. Folder
Consideration
Container
Folder
Level
Can contain folders or files
Can contain sub‑folders or files
Azure AD access control
RBAC at container level (coarse‑grained)
ACL at folder level (fine‑grained)
Non‑AAD access control
Supports anonymous or SAS‑key access
Folder level does not support non‑AAD access
Anti‑Pattern: Unrelated Data Growth
Without lifecycle policies, data can grow rapidly. Two common patterns:
Retaining many versions of refreshed data (e.g., 30‑day rolling windows) leads to exponential growth.
Workspace data accumulates when users leave unused datasets in the lake.
How Do I Manage Access to My Data?
ADLS Gen2 supports a combined RBAC and ACL model. RBAC is applied at the storage‑account or container level, while ACLs provide fine‑grained permissions on files and directories. SAS tokens and shared keys are also supported.
Key Considerations
Consideration
RBACs
ACLs
Scope
Storage accounts, containers; cross‑resource RBAC at subscription or resource‑group level
Files, directories
Limits
2000 RBACs per subscription
32 ACLs per file/folder (effective 28), both access and default
Supported permission levels
Built‑in or custom RBACs
ACL permissions (r, w, x)
When using RBAC alone at the container level, be aware of the 2000‑role‑assignment limit, especially in environments with many containers.
Recommendations
Create security groups for directories and assign them via ACLs rather than creating individual ACL entries for each principal.
Example: a /logs directory with two groups – LogsWriter (rwx) and LogsReader (r‑x). Add ADF, Databricks, and service principals to the appropriate groups.
Which Data Format Should I Choose?
Data may arrive as JSON, CSV, XML, compressed binaries, large files, or many small IoT events. While ADLS Gen2 can store any format, selecting an appropriate format improves pipeline efficiency and cost.
Key Considerations
Avro: Row‑oriented, good for write‑heavy or streaming scenarios (e.g., Event Hub, Kafka).
Parquet / ORC: Column‑oriented, ideal for read‑heavy analytical queries that target subsets of columns.
How Do I Manage My Data Lake Costs?
ADLS Gen2 aims to reduce total cost of ownership. Key cost‑management tactics include lifecycle policies, tiered storage, and appropriate replication choices.
Key Considerations
Use lifecycle policies to delete or tier data after a retention period (e.g., 5‑year compliance).
Choose the correct replication option: GRS for high availability, LRS for development environments.
Transactions are billed per 4 MiB; batch data to increase transaction size and reduce cost.
How Do I Monitor My Data Lake?
Telemetry is available via Azure Monitor storage logs, which can be routed to Log Analytics, Event Hub, or another storage account.
Key Considerations
For near‑real‑time analysis, send logs to a Log Analytics workspace and query the StorageBlobLogs table with KQL.
For long‑term retention, configure diagnostic settings to archive logs to a storage account.
For third‑party SIEMs (e.g., Splunk), forward logs to Event Hub.
Sample KQL Queries
Frequent operations: StorageBlobLogs | where TimeGenerated > ago(3d) | summarize count() by OperationName | sort by count_ desc | render piechart
High‑latency operations: StorageBlobLogs | where TimeGenerated > ago(3d) | top 10 by DurationMs desc | project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs - ServerLatencyMs
Operations causing most errors: StorageBlobLogs | where TimeGenerated > ago(3d) and StatusText !contains "Success" | summarize count() by OperationName | top 10 by count_ desc
Optimizing Your Data Lake for Scale and Performance
Performance tuning focuses on two pillars: maximizing transaction throughput (larger per‑transaction payloads) and reducing unnecessary file scans.
File Size and File Count
Many small files increase metadata overhead and reduce throughput. Aim for files of at least 100 MiB; combine small files during ingestion or use streaming services to write larger batches.
File Format
Parquet is a columnar format that enables predicate push‑down, column pruning, and efficient compression, reducing both I/O and storage cost.
Partitioning Scheme
Effective partitioning groups similar data together, allowing queries to skip irrelevant partitions. Example partition keys include datetime, sensor ID, or geographic region.
Option 1: /<sensorId>/<datetime>/<temperature>
Option 2: /<datetime>/<sensorId>/<temperature>
Option 3: /temperature/<datetime>/<sensorId>
Choose the layout that aligns with the most common query patterns.
Query Acceleration (Preview)
ADLS Gen2 Query Acceleration lets you specify predicates and column projections on unstructured data, reducing data transferred and lowering compute costs.
By filtering data early, you pay for fewer storage transactions and less compute.
Additional Resources
For the full article, visit: https://jiagoushi.pro/hitchhikers-guide-data-lake
Community discussion channels (WeChat, QQ, Bilibili, etc.) are listed in the original source.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.