Big Data 16 min read

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

This article details the design and implementation of a comprehensive big data permission system that integrates Hive, Presto, Hadoop, and Metabase, covering data access methods, authentication choices, Ranger-based authorization, policy management, and automated workflow integration to balance security and efficiency.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

Data Access Methods

Data stored in the offline warehouse can be accessed through:

Hive – Hive CLI (legacy) or the recommended Beeline CLI (HiveServer2 JDBC client) and the HiveServer2 JDBC API.

Presto – Presto JDBC API.

Hadoop – HDFS command‑line interface.

Flink – HDFS client API.

Access Channels

These methods are exposed via:

Client tools installed on cluster nodes.

Metabase (BI platform) – configured with Hive or Presto data sources; Metabase has its own internal authorization.

DolphinScheduler – abstracts data sources (primarily HiveServer2 JDBC) and runs shell‑based workflows that rely on installed CLI tools.

Real‑time development platform – integrates with Hive for streaming tasks.

Design Goals

Tighten permissions for non‑warehouse teams, especially Metabase users that currently share a high‑privilege data source.

Gradually replace Hive CLI with Beeline CLI to standardize access.

Provide a unified permission portal that can manage authorizations across Hive, Presto, HDFS and Metabase.

User Authentication

Supported authentication mechanisms:

HiveServer2 – Kerberos, SASL, NOSASL, LDAP, PAM, Custom.

Presto – Kerberos, LDAP, Password file.

Hadoop – Kerberos only.

To avoid the operational overhead of Kerberos, the system adopts LDAP (already in use) for Hive, Presto and Hadoop user authentication. Hadoop components themselves remain unauthenticated at the data‑node level.

Authorization

HiveServer2 supports SQL‑standard fine‑grained (column‑level) authorization. The implementation is plug‑in based; Apache Ranger was selected (alternatives include Apache Sentry). Presto and Hadoop also use Ranger plugins.

System Design

Authentication and authorization are applied on HiveServer2 and Presto coordinator; HDFS NameNode receives only authorization. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.

System architecture diagram
System architecture diagram

HiveServer2 LDAP Configuration Example

hive.server2.authentication=LDAP
hive.server2.authentication.ldap.baseDN=ou=People,dc=ipalfish,dc=com
hive.server2.authentication.ldap.url=ldap://*****:389
hive.server2.authentication.ldap.userDNPattern=cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com

User and Group Model

Introducing user groups reduces permission‑granting complexity from O(N) to O(1). In Ranger a User can belong to multiple Groups; a policy matches if the User or any of its Groups appear in the policy.

Group information is obtained via Hadoop’s Group Mapping mechanism, which can pull data from LDAP. The following Java fragment shows how the Ranger Presto plugin retrieves user and group data:

private RangerPrestoAccessRequest createAccessRequest(RangerPrestoResource resource, SystemSecurityContext context, PrestoAccessType accessType) {
    String userName = null;
    Set<String> userGroups = null;
    if (useUgi) {
        UserGroupInformation ugi = UserGroupInformation.createRemoteUser(context.getIdentity().getUser());
        userName = ugi.getShortUserName();
        String[] groups = ugi != null ? ugi.getGroupNames() : null;
        if (groups != null && groups.length > 0) {
            userGroups = new HashSet<>(Arrays.asList(groups));
        }
    } else {
        userName = context.getIdentity().getUser();
        userGroups = context.getIdentity().getGroups();
    }
    return new RangerPrestoAccessRequest(resource, userName, userGroups, accessType);
}

Permission Levels

Columns are classified into three sensitivity levels:

P0 – most sensitive, longest approval workflow.

P1 – medium sensitivity.

P2 – low sensitivity.

Requests are made at column granularity; the level determines the approval flow.

Policy Types

Access – grants read/write rights to specific columns. A naïve approach would require a separate policy per column, which is impractical for wide tables.

Mask – supplemental policy that masks data for columns without explicit access. Typically a single Access policy with wildcard "*" is combined with per‑column Mask policies.

All policies configured in Ranger Admin are periodically pulled by the Ranger plugin threads and enforced at runtime.

Policy Configuration Workflow

Because tables can be created through multiple channels (data‑modeling platform, direct CLI, etc.), synchronous policy creation is difficult. The system adopts an asynchronous approach:

Table‑creation events are emitted to a message queue.

An Apache Atlas‑based metadata center captures these events via Hive hooks.

The Atlas server forwards the events to a dedicated topic that the permission system listens to, triggering policy initialization.

Modifications to column permission levels are also captured by the metadata center, causing automatic policy updates.

Metabase Integration

Metabase’s native permission model is cumbersome. The integrated system provides two automated flows:

Report development permission – a synchronized Metabase account is created for the user; a data source using the user’s credentials is provisioned, allowing the user to develop reports against an authorized source.

Report view permission – the user is added to a Metabase group; the group is granted report access via an automated workflow. Approval of a report‑view request triggers the group‑level permission grant.

Both processes are driven by the same ticket‑based permission‑workflow engine, eliminating manual approvals.

Summary

The described big‑data permission system consolidates authentication (LDAP) and fine‑grained authorization (Apache Ranger) for Hive, Presto, HDFS and Metabase, introduces a ticket‑based request flow, and achieves high automation while maintaining column‑level security controls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Dataaccess controlHivePrestoApache RangerLDAP
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.