Information Security 16 min read

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

ITPUB

Jul 23, 2022

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

Background

With cloud computing and big‑data technologies maturing, massive and diverse datasets generate significant economic and social value, but data security remains a major challenge. Bilibili is committed to protecting user privacy data.

Ranger Overview

2.1 User Authentication

Hadoop lacks built‑in security; without authentication, any user can impersonate others. Since Hadoop 1.x, Kerberos has been used for authentication, and Bilibili's data platform clients are all Kerberos‑enabled.

2.2 Ranger Introduction

Kerberos only controls cluster login, not fine‑grained permissions, so Ranger is introduced for authorization. Ranger provides a centralized framework to enforce fine‑grained access control on Hadoop components such as HDFS, YARN, Hive, and HBase via policies configured through its console or REST API.

Bilibili uses a Ranger 1.2.0‑based deployment with two write nodes and two read nodes behind a load balancer for read‑write separation. When a user requests Hive table or HDFS path permissions, the request goes through the Shielder service, which calls Ranger Admin's REST API on the write side to create or update policies, persists them to the DB, and then performs a pre‑check on the read side before completing the workflow.

Read‑side Ranger Admin periodically loads all policies from the DB into memory; plugins poll the admin and receive the latest policies.

HDFS Path Authorization

3.1 Authorization API Refactor

The native Ranger API requires a full RangerPolicy, which is cumbersome for tools that only need to grant HDFS paths or tables. The API was simplified to accept four parameters:

service : Ranger service name

type : path or table resources : Specific HDFS path or Hive table

access : Permission type, read or write If type is table, the Hive Metastore client retrieves the table location to decide whether to create or update a policy.

3.2 Ranger Admin Refactor

The original admin loaded policies serially from the DB into a List, using a ListIterator that caused unpredictable load times when policies changed during reading. The new design loads policies in parallel into a Map, eliminating the need for ordered iteration and reducing load time from ~25 s to ~21 s for ~180 k policies, ~70 k items, and ~2 M accesses.

select obj from XXPolicyItemAccess obj, XXPolicyItem item
where obj.policyItemId = item.id
  and item.policyId in (select policy.id from XXPolicy policy where policy.service = :serviceId)

After removing the order by, load time improves by about 4 seconds.

3.3 Gray‑Release Deployment

A gray‑release mode was added to the HDFS plugin, supporting three modes: ALWAYS_ALLOW, ALWAYS_DENY, and BY_HADOOP_ACL. Policies are first checked; if none match, the mode determines the outcome. Strict groups can be configured without restarting the NameNode, and after all groups are verified, the mode switches to ALWAYS_DENY.

3.4 Permission Pre‑Check

Because policy propagation can be delayed, a pre‑check API was added to the admin side. It accepts a user, database/table, and access type, builds a RangerAccessRequest, evaluates matching policies, and returns the result. A background thread periodically refreshes the service policies cache at the same frequency as plugin polling.

Hive Table Authorization

4.1 Pain Points of HDFS Authorization

Issues include table owners lacking permission to drop their own tables (requiring parent path write permission) and view paths sharing the same HDFS location, causing unintended permission revocation. HDFS authorization is path‑level only, lacking fine‑grained control and data masking, prompting the move to Hive‑level authorization.

4.2 Hive Authorization Interface

Hive policies are translated into corresponding HDFS policies so that granting Hive permissions automatically grants the necessary HDFS permissions.

4.3 Hive Metastore Remote Authorization and Data Masking

All Ranger plugins load policies from the admin, which is acceptable for long‑running services (HiveServer2, Spark Thrift Server, Kyuubi) but costly for short‑lived jobs (Hive CLI, Spark SQL). To reduce load overhead, authorization and masking logic were moved into a Hive Metastore plugin, exposing two interfaces:

struct CheckPrivilegesBag {
  1: HiveOperationObjectType hiveOperationObjectType; // operation type
  2: list<HiveObjectPrivileges> inputPrivileges;   // input tables
  3: list<HiveObjectPrivileges> outputPrivileges;  // output tables
  4: HiveAuthzContextObject hiveAuthzContext;
  5: string user;
}

Row‑filter and column‑masking functions receive a HiveAuthzContextObject, a list of HiveObjectPrivileges, and the user:

list<HiveObjectPrivileges> apply_row_filter_and_column_masking(
    1: HiveAuthzContextObject hiveAuthzContextObject,
    2: list<HiveObjectPrivileges> objectPrivileges,
    3: string user) throws (1: MetaException o1)

4.4 Spark Ranger

Spark Ranger enforces authorization and masking by injecting rules and strategies into the query plan. To align with Hive Metastore, Spark operation and privilege objects are converted to Hive equivalents. Successful authorizations are cached to avoid repeated checks, improving performance for queries with many tables.

4.5 Presto Ranger

Presto’s native Ranger plugin periodically pulls Presto‑specific policies. Bilibili modified it to pull Hive policies instead, reducing operational overhead. The adapted plugin now supports table/column permissions, column masking, and row filtering.

Future Plans

5.1 Incremental Load Policy

Even after optimizations, full policy loading takes >25 s and will grow with more policies. Ranger 2.0 introduces incremental policy loading; Bilibili plans to adopt it to bring load time down to under 1 s.

5.2 Merging HDFS and Hive Policies

Currently, Hive and HDFS maintain separate policies, causing duplicate effort and issues like table owners being unable to drop tables. The next step is to merge Hive policies into the HDFS plugin, storing table location data and checking Hive permissions first before falling back to HDFS policies.

5.3 Moving HDFS Authorization to NNProxy

With over 20 NameNode groups each contacting Ranger Admin, load on the admin is high. Bilibili intends to shift HDFS authorization logic to an NNProxy layer, reducing NameNode processing time and admin pressure.

References

https://blog.cloudera.com/an-introduction-to-ranger-rms/

https://issues.apache.org/jira/browse/RANGER-2341

https://mp.weixin.qq.com/s?__biz=MzIxMTE0ODU5NQ==∣=2650247544&idx=1&sn=192ae24e3114502180a3b861e5f12a5c

https://mp.weixin.qq.com/s?__biz=MzAxOTY5MDMxNA==∣=2455762102&idx=1&sn=37281abfcecd4f247fb291bb8c3de8

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Access Control Hive Data Security HDFS ranger presto Spark

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.