Big Data 41 min read

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.

Architects Research Society

Jan 25, 2022

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

ADLS Gen2: When Is It the Right Choice for Your Data Lake?

Enterprise data lakes serve as a central repository for unstructured, semi‑structured, and structured data used in big‑data platforms. Their goal is to eliminate data silos and provide a single storage layer that meets diverse organizational data needs. For details on choosing the right storage solution, see the Azure article on selecting big‑data storage technologies.

A common question is when to use a data warehouse versus a data lake. Treat them as complementary: a data lake stores raw data from many sources, while a data warehouse holds highly structured data for specific analytics. Both can be used together to derive insights, build machine‑learning models, and support BI reporting.

ADLS Gen2 is an enterprise‑grade, hyperscale data store optimized for big‑data analytics. It offers hierarchical namespace, faster performance, Hadoop‑compatible access, fine‑grained ACLs, native Azure AD integration, and cost‑effective security. It is suited for workloads that ingest massive volumes (multiple petabytes) and require high throughput (hundreds of Gbps).

Key Design Considerations for a Data Lake

When building an enterprise data lake on ADLS Gen2, consider the following use‑case questions:

What data will be stored?

How much data will be stored?

Which parts of the data will be used for analytics?

Who needs access to which parts?

What analytics workloads will run?

What transaction patterns do the workloads exhibit?

What is the budget?

The remainder of this document follows a structure of available options with pros/cons, factors to consider, recommended patterns, and anti‑patterns.

Terminology

Key terms used throughout the guide assume you have an Azure subscription.

Resource : Manageable Azure objects such as VMs, storage accounts, or virtual networks.

Subscription : Logical container for managing resources and billing.

Resource group : Logical container for related Azure resources.

Storage account : Azure resource that holds blobs, files, queues, tables, and disks. ADLS Gen2 is a Blob storage account with hierarchical namespace enabled.

Container : Organizes a set of objects (or files). No limit on the number of containers per account.

Folder/Directory : Organizes objects within a container; supports ACLs (access and default).

Object/File : The actual data entity; has an access ACL only.

Organizing and Managing Data in the Lake

Customers often use a single logical data lake that may be implemented as one or multiple ADLS Gen2 accounts, depending on governance, billing, and isolation requirements.

Example scenario: a large retail customer (Contoso.com) builds a data‑lake strategy to support various predictive‑analytics use cases.

Centralized vs. Federated Implementation

Two models are available: a fully centralized lake managed by a single team, or a federated model where each business unit owns its own lake while a central team governs security and policies. Both can be realized with a single storage account or multiple accounts.

Key Considerations

Single account simplifies RBAC, firewall, and lifecycle management.

Multiple accounts enable data isolation, separate billing, and per‑unit governance.

Global Enterprise Data Lake

When data must be shared globally, distinguish between globally shared data and region‑restricted data (e.g., personal data subject to sovereignty). Use region‑specific accounts for restricted data while maintaining a logical central lake.

Customer‑Specific Isolation

Multi‑tenant scenarios may require separate lakes per customer to enforce distinct governance and cost models.

Recommendations

Create separate accounts for dev and prod (preferably in different subscriptions).

Identify logical data sets and decide on unified vs. isolated management.

Start with a single account and add more only when isolation or regional requirements demand it.

Remember that other resources (VM cores, Azure Data Factory instances) also have subscription limits.

Anti‑Patterns

Avoid excessive lake management : Do not provision more accounts than needed, as this adds operational overhead without ROI.

Uncontrolled data replication : Replicating data across accounts can create source‑of‑truth ambiguity and increase transaction costs.

Scalability assumptions : Verify that the account can handle petabyte‑scale data and high‑throughput workloads; contact Microsoft for support if you anticipate >10,000 cores or hundreds of Gbps.

How to Organize My Data?

Data can be organized hierarchically using containers, folders, and files. Common zones include:

Raw data : Ingested as‑is; used by data‑engineering pipelines for cleaning and enrichment.

Enriched data : Cleaned and combined with other sources; stored in formats like Parquet.

Curated data : High‑value, structured data for BI and data‑science consumption.

Workspace data : User‑provided datasets used alongside curated data for ad‑hoc analysis.

Archive data : Long‑term retention for compliance; stored in Cool or Archive tiers.

Key Considerations

Choose a folder structure that reflects data semantics and consumer access patterns (e.g., /raw/, /enriched/, /curated/, /workspace/).

Recommendations

Create separate containers or folders for each zone.

Within a zone, partition by logical attributes such as date, region, or business unit.

Consider access‑control models when designing folder hierarchies.

Consideration

Raw

Enriched

Curated

Workspace

Consumers

Data‑engineering team

Data‑engineering team + temporary data‑science/BI access

Data‑engineers, BI analysts, data scientists

Data scientists / BI analysts

Access control

Locked to engineering team

Full control for engineers, read for analysts

Full control for engineers, read/write for analysts

Full control for engineers and analysts

Lifecycle management

Move to cooler tier after enrichment

Move older data to cooler tier

Apply policy‑driven cleanup (e.g., DLM)

Folder hierarchy

Reflect ingestion pattern

Reflect business‑unit organization

Reflect workspace team structure

Example paths

/raw/sensordata, /raw/lobappdata, /raw/userclickdata

/enriched/sales, /enriched/manufacturing

/curated/sales, /curated/manufacturing

/workspace/salesBI, /workspace/manufacturing, /workspace/datascience

Container vs. Folder

Consideration

Container

Folder

Level

Can contain folders or files

Can contain sub‑folders or files

Azure AD access control

RBAC at container level (coarse‑grained)

ACL at folder level (fine‑grained)

Non‑AAD access control

Supports anonymous or SAS‑key access

Folder level does not support non‑AAD access

Anti‑Pattern: Unrelated Data Growth

Without lifecycle policies, data can grow rapidly. Two common patterns:

Retaining many versions of refreshed data (e.g., 30‑day rolling windows) leads to exponential growth.

Workspace data accumulates when users leave unused datasets in the lake.

How Do I Manage Access to My Data?

ADLS Gen2 supports a combined RBAC and ACL model. RBAC is applied at the storage‑account or container level, while ACLs provide fine‑grained permissions on files and directories. SAS tokens and shared keys are also supported.

Key Considerations

Consideration

RBACs

ACLs

Scope

Storage accounts, containers; cross‑resource RBAC at subscription or resource‑group level

Files, directories

Limits

2000 RBACs per subscription

32 ACLs per file/folder (effective 28), both access and default

Supported permission levels

Built‑in or custom RBACs

ACL permissions (r, w, x)

When using RBAC alone at the container level, be aware of the 2000‑role‑assignment limit, especially in environments with many containers.

Recommendations

Create security groups for directories and assign them via ACLs rather than creating individual ACL entries for each principal.

Example: a /logs directory with two groups – LogsWriter (rwx) and LogsReader (r‑x). Add ADF, Databricks, and service principals to the appropriate groups.

Which Data Format Should I Choose?

Data may arrive as JSON, CSV, XML, compressed binaries, large files, or many small IoT events. While ADLS Gen2 can store any format, selecting an appropriate format improves pipeline efficiency and cost.

Key Considerations

Avro: Row‑oriented, good for write‑heavy or streaming scenarios (e.g., Event Hub, Kafka).

Parquet / ORC: Column‑oriented, ideal for read‑heavy analytical queries that target subsets of columns.

How Do I Manage My Data Lake Costs?

ADLS Gen2 aims to reduce total cost of ownership. Key cost‑management tactics include lifecycle policies, tiered storage, and appropriate replication choices.

Key Considerations

Use lifecycle policies to delete or tier data after a retention period (e.g., 5‑year compliance).

Choose the correct replication option: GRS for high availability, LRS for development environments.

Transactions are billed per 4 MiB; batch data to increase transaction size and reduce cost.

How Do I Monitor My Data Lake?

Telemetry is available via Azure Monitor storage logs, which can be routed to Log Analytics, Event Hub, or another storage account.

Key Considerations

For near‑real‑time analysis, send logs to a Log Analytics workspace and query the StorageBlobLogs table with KQL.

For long‑term retention, configure diagnostic settings to archive logs to a storage account.

For third‑party SIEMs (e.g., Splunk), forward logs to Event Hub.

Sample KQL Queries

Frequent operations:

StorageBlobLogs
| where TimeGenerated > ago(3d)
| summarize count() by OperationName
| sort by count_ desc
| render piechart

High‑latency operations:

StorageBlobLogs
| where TimeGenerated > ago(3d)
| top 10 by DurationMs desc
| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs,
          ClientLatencyMs = DurationMs - ServerLatencyMs

Operations causing most errors:

StorageBlobLogs
| where TimeGenerated > ago(3d) and StatusText !contains "Success"
| summarize count() by OperationName
| top 10 by count_ desc

Optimizing Your Data Lake for Scale and Performance

Performance tuning focuses on two pillars: maximizing transaction throughput (larger per‑transaction payloads) and reducing unnecessary file scans.

File Size and File Count

Many small files increase metadata overhead and reduce throughput. Aim for files of at least 100 MiB; combine small files during ingestion or use streaming services to write larger batches.

File Format

Parquet is a columnar format that enables predicate push‑down, column pruning, and efficient compression, reducing both I/O and storage cost.

Partitioning Scheme

Effective partitioning groups similar data together, allowing queries to skip irrelevant partitions. Example partition keys include datetime, sensor ID, or geographic region.

Option 1: /<sensorId>/<datetime>/<temperature>

Option 2: /<datetime>/<sensorId>/<temperature>

Option 3: /temperature/<datetime>/<sensorId>

Choose the layout that aligns with the most common query patterns.

Query Acceleration (Preview)

ADLS Gen2 Query Acceleration lets you specify predicates and column projections on unstructured data, reducing data transferred and lowering compute costs.

By filtering data early, you pay for fewer storage transactions and less compute.

Additional Resources

For the full article, visit: https://jiagoushi.pro/hitchhikers-guide-data-lake

Community discussion channels (WeChat, QQ, Bilibili, etc.) are listed in the original source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data storage Azure ADLS Gen2

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.