Big Data 14 min read

Design and Implementation of Banyu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of Banyu's big data permission system, which secures Hive, Presto, HDFS and other data access components using Apache Ranger and LDAP.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Design and Implementation of Banyu's Big Data Permission System

In the early stage of Banyu, the data warehouse operated without any permission checks or auditing, prioritizing efficiency; as the business grew, data security became critical, leading to the development of a comprehensive big data permission system.

Background : Data in the offline warehouse can be accessed via Hive (CLI and Beeline), Presto, Hadoop HDFS CLI, and Flink HDFS client. Access channels include cluster‑installed client tools, Metabase (BI platform) with its own authorization, DolphinScheduler for offline development, and real‑time development platforms.

Design Goals : Balance control and efficiency by tightening non‑warehouse team permissions, gradually replacing Hive CLI with Beeline, and unifying permission operations in a single platform.

Research :

Authentication: HiveServer2 supports Kerberos, SASL, LDAP, etc.; Presto supports Kerberos, LDAP, password file; Hadoop components only support Kerberos. To avoid high operational cost, LDAP was chosen for user authentication across components.

Authorization: HiveServer2 uses SQL‑standard fine‑grained (column‑level) policies, implemented via plugins such as Apache Ranger (chosen over Apache Sentry). Presto and Hadoop also adopt Ranger for authorization.

System Design : User authentication and authorization are applied on HiveServer2 and Presto Coordinator, while HDFS NameNode receives only authorization. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.

User Authentication Configuration (HiveServer2 example) :

<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://*****:389</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.userDNPattern</name>
  <value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>

Authorization Concepts : Users and groups are the basic objects; groups reduce the complexity of granting permissions (O(N) → O(1)). Ranger stores Access policies (allowing column‑level access) and Mask policies (providing data masking for unauthorized columns). Access policies are required for any column, while Mask policies enable partial visibility by masking sensitive fields.

Policy Configuration : Access policies are defined per column, and Mask policies supplement them to hide data when a user lacks permission. The article shows examples of permission request tickets and corresponding Ranger Access and Mask policies.

Permission Integration with Metabase : Metabase’s native permission model is cumbersome; the system integrates Metabase by synchronizing accounts, creating data sources tied to user credentials for report development, and managing group‑based report view permissions through automated workflow tickets.

Workflow and Components : The permission system consists of a workflow state machine (handling ticket status), an event listener (capturing metadata changes via Apache Atlas), and a deployment module (pushing policies to Ranger). Both synchronous (post‑table‑creation) and asynchronous (message‑queue‑driven) policy creation strategies are discussed.

Summary : The implemented big data permission system consolidates access control for Presto, Hive, HDFS, and Metabase, introduces a ticket‑based request flow that greatly reduces manual effort, and achieves a high degree of automation while maintaining fine‑grained security.

Big Dataaccess controlHiveprestoApache RangerLDAP
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.