Big Data 16 min read

Designing a Scalable Big Data Permission System: From Hive to Metabase

BanYu’s early data warehouse lacked any access controls, prompting the creation of a comprehensive big‑data permission system that integrates authentication and authorization across Hive, Presto, HDFS, and Metabase using LDAP, Ranger policies, workflow automation, and both synchronous and asynchronous policy initialization.

21CTO

Dec 9, 2021

Designing a Scalable Big Data Permission System: From Hive to Metabase

In BanYu’s early stage, the data warehouse operated without any permission checks or auditing, prioritizing efficiency over security.

Background

Data Access Methods

Data in the offline warehouse can be accessed via:

Hive: Hive CLI (deprecated) or Beeline CLI (HiveServer2 JDBC client).

Presto: Presto JDBC API.

Hadoop: HDFS CLI.

Flink: HDFS Client API.

Data Access Channels

The above methods are used through:

Client tools installed on cluster nodes.

Metabase (BI platform) with Hive or Presto data sources and its own authorization.

DolphinScheduler (offline development platform) using Hive data source and CLI‑based workflows.

Real‑time development platform integrated with Hive.

Design Goals

Balance control and efficiency, aiming to:

Tighten permissions for non‑data‑warehouse teams, especially Metabase users.

Gradually replace Hive CLI with Beeline CLI.

Unify permission operations across systems via a single big‑data permission platform.

Research

Permission control involves two aspects: authentication and authorization.

User Authentication

Supported authentication methods for key components:

HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom.

Presto: Kerberos, LDAP, Password File.

Hadoop components: Kerberos only.

To avoid adding new components, LDAP is chosen for user authentication across Hadoop‑related services.

Authorization

HiveServer2: SQL‑standard, column‑level access control; implemented via Apache Ranger.

Presto and Hadoop: also use Apache Ranger plugins.

System Design

The relationship between components and permission control is illustrated below:

Authentication and authorization are applied on HiveServer2 and Presto Coordinator, while only authorization is applied on HDFS NameNode. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.

User Authentication Configuration

Example configuration for HiveServer2:

<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://*****:389</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.userDNPattern</name>
  <value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>

Authorization Details

User and Group

Introducing groups reduces permission‑granting complexity from O(N) to O(1). In Ranger, a User can belong to multiple Groups; a policy passes if the User or any of its Groups is listed.

Group information is obtained via Hadoop’s Group Mapping mechanism, which can source data from LDAP.

Permission Levels

Columns are classified into three sensitivity levels: P0 (most sensitive), P1, and P2, each with its own approval workflow.

Policy Types

Ranger supports two policy types:

Access: Grants direct read/write rights on columns.

Mask: Applies data masking for columns without explicit access, built on top of Access policies.

All policies configured in Ranger Admin are periodically pulled by Ranger plugins and enforced at runtime.

Policy Configuration

When a Hive table is created, its permission policies must be generated. Two approaches are considered:

Synchronous Creation : Embed policy initialization in the table‑creation workflow. This works for tables created via the data‑modeling platform, which supplies metadata, but not for direct CLI creations.

Asynchronous Creation : Publish table‑creation events to a message queue (via the Metadata Center built on Apache Atlas). A listener consumes these events and creates the corresponding Ranger policies.

Modifications to column permission levels are only allowed through the Metadata Center, triggering policy updates via the same asynchronous mechanism.

Overall architecture:

The big‑data permission system consists of:

Workflow State Machine: Handles ticket status transitions.

EventListener: Listens to metadata change events.

Deployment: Pushes policies to Ranger and integrates with Metabase.

Permission Integration with Metabase

Metabase, the company’s primary BI tool, has its own cumbersome permission model. The new system provides:

Report Development Permission : Users can sync accounts, creating a data source with their credentials; the data source is then used for report development.

Report Viewing Permission : Permissions are granted to Metabase groups. Syncing an account also creates a corresponding group and adds the user’s Metabase account to it. Report‑view requests generate tickets that, once approved, grant the group access.

Example of a Metabase report permission ticket is shown below:

Conclusion

The article outlines the core design of BanYu’s big‑data permission system. Permissions for Presto, Hive, HDFS, and Metabase are unified, and a ticket‑based workflow dramatically reduces the cost of granting access, achieving high automation and satisfactory results.

Source: BanYu Technology Department

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive Data Security Authorization ranger presto LDAP

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.