Designing a Scalable Big Data Permission System: From Hive to Metabase
BanYu’s early data warehouse lacked any access controls, prompting the creation of a comprehensive big‑data permission system that integrates authentication and authorization across Hive, Presto, HDFS, and Metabase using LDAP, Ranger policies, workflow automation, and both synchronous and asynchronous policy initialization.
In BanYu’s early stage, the data warehouse operated without any permission checks or auditing, prioritizing efficiency over security.
Background
Data Access Methods
Data in the offline warehouse can be accessed via:
Hive: Hive CLI (deprecated) or Beeline CLI (HiveServer2 JDBC client).
Presto: Presto JDBC API.
Hadoop: HDFS CLI.
Flink: HDFS Client API.
Data Access Channels
The above methods are used through:
Client tools installed on cluster nodes.
Metabase (BI platform) with Hive or Presto data sources and its own authorization.
DolphinScheduler (offline development platform) using Hive data source and CLI‑based workflows.
Real‑time development platform integrated with Hive.
Design Goals
Balance control and efficiency, aiming to:
Tighten permissions for non‑data‑warehouse teams, especially Metabase users.
Gradually replace Hive CLI with Beeline CLI.
Unify permission operations across systems via a single big‑data permission platform.
Research
Permission control involves two aspects: authentication and authorization.
User Authentication
Supported authentication methods for key components:
HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom.
Presto: Kerberos, LDAP, Password File.
Hadoop components: Kerberos only.
To avoid adding new components, LDAP is chosen for user authentication across Hadoop‑related services.
Authorization
HiveServer2: SQL‑standard, column‑level access control; implemented via Apache Ranger.
Presto and Hadoop: also use Apache Ranger plugins.
System Design
The relationship between components and permission control is illustrated below:
Authentication and authorization are applied on HiveServer2 and Presto Coordinator, while only authorization is applied on HDFS NameNode. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.
User Authentication Configuration
Example configuration for HiveServer2:
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>ldap://*****:389</value>
</property>
<property>
<name>hive.server2.authentication.ldap.userDNPattern</name>
<value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>Authorization Details
User and Group
Introducing groups reduces permission‑granting complexity from O(N) to O(1). In Ranger, a User can belong to multiple Groups; a policy passes if the User or any of its Groups is listed.
Group information is obtained via Hadoop’s Group Mapping mechanism, which can source data from LDAP.
Permission Levels
Columns are classified into three sensitivity levels: P0 (most sensitive), P1, and P2, each with its own approval workflow.
Policy Types
Ranger supports two policy types:
Access: Grants direct read/write rights on columns.
Mask: Applies data masking for columns without explicit access, built on top of Access policies.
All policies configured in Ranger Admin are periodically pulled by Ranger plugins and enforced at runtime.
Policy Configuration
When a Hive table is created, its permission policies must be generated. Two approaches are considered:
Synchronous Creation : Embed policy initialization in the table‑creation workflow. This works for tables created via the data‑modeling platform, which supplies metadata, but not for direct CLI creations.
Asynchronous Creation : Publish table‑creation events to a message queue (via the Metadata Center built on Apache Atlas). A listener consumes these events and creates the corresponding Ranger policies.
Modifications to column permission levels are only allowed through the Metadata Center, triggering policy updates via the same asynchronous mechanism.
Overall architecture:
The big‑data permission system consists of:
Workflow State Machine: Handles ticket status transitions.
EventListener: Listens to metadata change events.
Deployment: Pushes policies to Ranger and integrates with Metabase.
Permission Integration with Metabase
Metabase, the company’s primary BI tool, has its own cumbersome permission model. The new system provides:
Report Development Permission : Users can sync accounts, creating a data source with their credentials; the data source is then used for report development.
Report Viewing Permission : Permissions are granted to Metabase groups. Syncing an account also creates a corresponding group and adds the user’s Metabase account to it. Report‑view requests generate tickets that, once approved, grant the group access.
Example of a Metabase report permission ticket is shown below:
Conclusion
The article outlines the core design of BanYu’s big‑data permission system. Permissions for Presto, Hive, HDFS, and Metabase are unified, and a ticket‑based workflow dramatically reduces the cost of granting access, achieving high automation and satisfactory results.
Source: BanYu Technology Department
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
