Big Data 15 min read

Design and Implementation of BanYu's Big Data Access Control System

This article describes the evolution from an unsecured data warehouse to a comprehensive big‑data access control system at BanYu, detailing the background, data access methods, design goals, authentication and authorization mechanisms, policy configuration, integration with Metabase, and the overall workflow that balances security with efficiency.

Top Architect

Dec 13, 2021

Design and Implementation of BanYu's Big Data Access Control System

In the early stage of BanYu, the data warehouse operated without any permission checks or auditing, allowing users unrestricted access; as the business expanded, data security became critical, prompting the creation of a big‑data permission system.

Background

The system needed to manage both user authentication and authorization across Hive, Presto, Hadoop, and Metabase.

Data Access Methods

Hive: CLI (deprecated) and Beeline (HiveServer2 JDBC client).

Presto: JDBC API.

Hadoop: HDFS CLI.

Flink: HDFS Client API.

Data Access Channels

Client tools installed on cluster nodes.

Metabase (BI platform) with Hive and Presto data sources.

Offline development platform (DolphinScheduler) using Hive data sources and CLI scripts.

Real‑time development platform integrated with Hive.

Design Goals

Restrict non‑data‑warehouse team permissions, especially Metabase users.

Gradually replace Hive CLI with Beeline.

Unify permission operations in a single platform.

Research on Authentication & Authorization

User Authentication

Supported authentication methods:

HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom.

Presto: Kerberos, LDAP, Password File.

Hadoop components: Kerberos only.

To avoid high operational cost of Kerberos, LDAP was chosen as the unified authentication method.

Authorization

HiveServer2 supports SQL‑standard fine‑grained (column‑level) access via plugins such as Apache Ranger, which was selected for this project. Presto and Hadoop also use Ranger plugins for authorization.

System Design

The data flow diagram (omitted) shows that HiveServer2 and Presto Coordinator perform authentication and authorization, while HDFS NameNode only handles authorization. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.

User Authentication Configuration Example

<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://*****:389</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.userDNPattern</name>
  <value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>

Authorization Details

Ranger defines two policy types:

Access: Grants column‑level read/write rights.

Mask: Applies data masking for columns without explicit access, built on top of Access policies.

Policies are generated automatically when a table is created; the owner receives full access, and column‑level sensitivity levels (P0, P1, P2) determine approval workflow.

Policy Generation & Initialization

Two approaches were considered:

Synchronous creation : Embed policy initialization in the table‑creation process (e.g., via a data‑modeling platform).

Asynchronous creation : Publish table‑creation events to a message queue using the metadata center (built on Apache Atlas) and let a background listener create policies.

The asynchronous method was chosen to handle multiple creation channels (modeling platform, CLI, etc.).

Permission Integration with Metabase

Metabase report development requires a data source; the system synchronizes user accounts and creates corresponding Metabase groups. Report view permissions are granted to groups, and a workflow handles approval for group membership.

Summary

The BanYu big‑data permission system now covers Hive, Presto, HDFS, and Metabase, providing a workflow‑driven, automated authorization process that balances security with operational efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Metadata Access Control Hive ranger presto LDAP

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.