How Youzan Built a Scalable Big Data Security Framework for Privacy Protection
This article details Youzan's end‑to‑end big data security architecture, covering data lifecycle protection, classification, access control, auditing, backup, privacy safeguards, sensitive data detection, masking strategies, and compliance processes to ensure secure and compliant data handling across the platform.
Background
Data volume doubles roughly every two years, making big‑data security a critical requirement. Security must protect data throughout its entire lifecycle and ensure compliance with relevant regulations.
Definition and Goals of Big Data Security
Based on the classic CIA triad (confidentiality, integrity, availability), big‑data security aims to protect data during production, usage, storage, transmission, disclosure, and destruction while meeting compliance obligations.
Meet basic data‑security needs, including protection of sensitive data and regulatory compliance.
Cover all data stages and application scenarios, not limited to a single platform.
Support data classification, role‑based permissions, and lifecycle management.
Provide systematic compliance handling across the whole process.
Overall Architecture
The architecture consists of three layers from bottom to top: data platform security, data management security, and privacy‑protection security. Compliance handling permeates the entire stack.
Big Data Platform Security
Boundary Security
Only authorized users can access the big‑data cluster, ensuring secure ingress and egress of data. This includes identity authentication, network isolation, and interface authorization.
Identity Authentication : All data development is routed through platform‑level tools such as the Data Development Platform (DP) or real‑time computation platforms. Users must authenticate to these tools before accessing underlying components.
Network Isolation : Clusters are isolated at the network layer across environments and data centers to prevent cross‑environment data leakage.
Interface Authorization : Both internal‑to‑internal and internal‑to‑external API calls undergo policy checks to block unauthorized usage, eavesdropping, or bypass attacks.
Access and Permission Control
Data Access Permission : Permission checks are delegated to the open‑source Apache Ranger component. Users request access via the DP platform; Ranger evaluates policies based on data classification and user roles.
Permission Auditing : All permission requests and approvals are logged for audit purposes. Expired permissions are periodically cleaned up by automated tasks.
Auditing and Backup
Data Auditing : Audit logs are collected daily via T+1 offline jobs and made searchable for administrators to review and troubleshoot.
Backup and Recovery : Backup processes are tied to data‑lifecycle definitions. Data is periodically backed up to a cold‑storage cluster, with backup frequency and scope aligned to the data’s lifecycle stage to reduce storage costs.
Privacy Protection and Personal Data Security
Data Security Capabilities
The platform provides functions such as sensitive data masking, classification & grading, metadata management, storage encryption, and data lineage tracing.
Data Classification and Grading
Data is divided into three categories—company data, business data, and customer data—each further split into four security levels. Classification is applied at the source (e.g., MySQL tables) and propagated through data lineage.
Sensitive Data Identification
Eight personal sensitive data types are targeted: address, QQ number, WeChat ID, email, phone number, name, ID card, and bank card.
Optimization of Identification Process
Reduced full‑sample identification time to 1–2 hours.
Supported incremental sampling with minute‑level response.
Integrated with the data‑asset platform for custom sensitivity levels.
Improved rule accuracy for sensitive categories.
Key Steps
Engine Selection : Use Presto for large tables, Hive for smaller ones to avoid resource contention.
Sample Table Optimization : Sample only new tables or those updated within a day.
Partition Handling : Apply different sampling strategies for partitioned vs. non‑partitioned tables.
Field Filtering : Sample only string/numeric fields, exclude time fields and obvious non‑sensitive identifiers.
Sample Richness : Ensure sufficient random samples, enforce length limits for strings, and filter out null values.
Incremental Detection : Trigger detection on data changes via DP to achieve near‑real‑time protection.
Lineage‑Based Inheritance : Propagate sensitivity levels through data‑lineage relationships.
Data Masking
Identified sensitive fields are masked using Ranger’s mask policies. Different sensitivity levels correspond to distinct masking rules, applied via HTTP calls to Ranger.
Compliance Handling
Data Export Control
Export processes are divided into internal calls (managed within DP) and external data provision to merchants. Both require approval based on data volume and sensitivity.
Other Compliance Measures
Additional controls include data‑leak emergency response plans, security standards for third‑party services, and continuous monitoring to ensure that every data movement adheres to defined security policies.
Conclusion and Outlook
The current security framework addresses data‑lifecycle protection and compliance, but gaps remain—such as lack of proactive monitoring and predictive risk detection. Future work will focus on enhancing audit completeness, real‑time monitoring, and automated risk prediction.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
