Design and Implementation of BanYu's Big Data Access Control System
This article describes the evolution from an unsecured data warehouse to a comprehensive big‑data access control system at BanYu, detailing the background, data access methods, design goals, authentication and authorization mechanisms, policy configuration, integration with Metabase, and the overall workflow that balances security with efficiency.
In the early stage of BanYu, the data warehouse operated without any permission checks or auditing, allowing users unrestricted access; as the business expanded, data security became critical, prompting the creation of a big‑data permission system.
Background
The system needed to manage both user authentication and authorization across Hive, Presto, Hadoop, and Metabase.
Data Access Methods
Hive: CLI (deprecated) and Beeline (HiveServer2 JDBC client).
Presto: JDBC API.
Hadoop: HDFS CLI.
Flink: HDFS Client API.
Data Access Channels
Client tools installed on cluster nodes.
Metabase (BI platform) with Hive and Presto data sources.
Offline development platform (DolphinScheduler) using Hive data sources and CLI scripts.
Real‑time development platform integrated with Hive.
Design Goals
Restrict non‑data‑warehouse team permissions, especially Metabase users.
Gradually replace Hive CLI with Beeline.
Unify permission operations in a single platform.
Research on Authentication & Authorization
User Authentication
Supported authentication methods:
HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom.
Presto: Kerberos, LDAP, Password File.
Hadoop components: Kerberos only.
To avoid high operational cost of Kerberos, LDAP was chosen as the unified authentication method.
Authorization
HiveServer2 supports SQL‑standard fine‑grained (column‑level) access via plugins such as Apache Ranger, which was selected for this project. Presto and Hadoop also use Ranger plugins for authorization.
System Design
The data flow diagram (omitted) shows that HiveServer2 and Presto Coordinator perform authentication and authorization, while HDFS NameNode only handles authorization. Ranger plugins rely on Hadoop Group Mapping, which in turn uses LDAP.
User Authentication Configuration Example
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>ldap://*****:389</value>
</property>
<property>
<name>hive.server2.authentication.ldap.userDNPattern</name>
<value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>Authorization Details
Ranger defines two policy types:
Access: Grants column‑level read/write rights.
Mask: Applies data masking for columns without explicit access, built on top of Access policies.
Policies are generated automatically when a table is created; the owner receives full access, and column‑level sensitivity levels (P0, P1, P2) determine approval workflow.
Policy Generation & Initialization
Two approaches were considered:
Synchronous creation : Embed policy initialization in the table‑creation process (e.g., via a data‑modeling platform).
Asynchronous creation : Publish table‑creation events to a message queue using the metadata center (built on Apache Atlas) and let a background listener create policies.
The asynchronous method was chosen to handle multiple creation channels (modeling platform, CLI, etc.).
Permission Integration with Metabase
Metabase report development requires a data source; the system synchronizes user accounts and creates corresponding Metabase groups. Report view permissions are granted to groups, and a workflow handles approval for group membership.
Summary
The BanYu big‑data permission system now covers Hive, Presto, HDFS, and Metabase, providing a workflow‑driven, automated authorization process that balances security with operational efficiency.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.