Information Security 12 min read

Didi's Big Data Security Permission System: User Authentication and Column-Level Authorization

Didi’s big‑data platform secures user access by adding a custom password‑based Hadoop authentication stored on the NameNode and managed through its Shumeng user‑management service, while enforcing column‑level, role‑based permissions via an Apache Ranger‑powered system that classifies data, generates policies, and now governs millions of assets.

Didi Tech
Didi Tech
Didi Tech
Didi's Big Data Security Permission System: User Authentication and Column-Level Authorization

Didi treats data as a critical asset, having built a data warehouse, data analysis, data mining, and data science capabilities that support rapid business growth. Ensuring both easy data access and strong security presents a major challenge for the big data platform.

The article first describes Didi’s self‑developed account/password authentication mechanism. Since vanilla Hadoop lacks security, anyone could set export HADOOP_USER_NAME=Anyone to impersonate any user. To mitigate this, Didi added a password field to hadoop IpcConnectionContext.proto and validates passwords in the NameNode. Clients (Hadoop, Spark, Hive) set the password via export HADOOP_USER_PASSWORD=123456 or System.setProperty("HADOOP_USER_PASSWORD","123456") . For Beeline/JDBC access through Hive Server2, the password is supplied as beeline -u jdbc:hive2://xxxxxx -n test -p 123456 .

Passwords are stored in a local file on the NameNode and periodically refreshed into memory.

User management is handled by Didi’s “Shumeng” platform, which maintains multi‑tenant user data assets, including password information, and provides password generation, maintenance, and synchronization to the NameNode and Hive server.

Moving to authorization, Didi evolved from a basic Hive SQL Standard‑based + HDFS UGO table‑level model to a column‑level permission system built on Apache Ranger in 2018. Data are classified into four sensitivity levels—public (C1), internal (C2), secret (C3), and confidential (C4). A role‑based access control (RBAC) model is used, with field‑level policies such as db.db.table.$column.c4 . Role packages for different levels reduce the barrier for low‑sensitivity data while protecting high‑sensitivity assets.

The permission system comprises several modules: a data labeling team tags data via real‑time Kafka DDL events and a safety algorithm, publishing results to Didi’s internal message queue (DDMQ); a metadata management service subscribes to DDMQ to keep column classifications up‑to‑date and offers a unified metadata query interface; a manual correction service; a table‑sampling service based on Presto; a visual permission‑application platform for Hive tables and HDFS paths; an independent SQL authentication service (RESTful API) that stores policies in MySQL; a policy‑generation module that creates Ranger policies from classification data and pushes them via Ranger Admin APIs; and the engine layer consisting of community Ranger Admin, a self‑developed Ranger Metastore (centralized real‑time auth via extended Hive Metastore Thrift), a self‑developed Ranger Plugin for engine‑side authorization checks, and a "big account" mechanism that bypasses HDFS UGO checks after metadata authorization succeeds.

After two years of operation, the system enforces millions of security policies and has significantly strengthened the security of Didi’s big data platform.

The article concludes with a recruitment notice for the offline engine & HBase team, inviting experts in HDFS, YARN, Spark, HBase to join by emailing [email protected].

big dataAuthenticationData Securityauthorizationcolumn levelranger
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.