Simplify Large‑Scale Database Security: Discover, Classify & Encrypt Sensitive Data
This article outlines Vivo’s comprehensive approach to large‑scale database security, covering recent data‑leak incidents, legal compliance, the current challenges of sensitive data governance, a field‑level discovery method, classification standards, automated scoring, and multiple encryption solutions including Proxy, MyBatis, and ShardingSphere.
Background
Recent high‑profile data‑leak incidents (e.g., Facebook, T‑Mobile, Amazon) have highlighted severe risks when databases are not properly secured. Non‑compliance with regulations such as China’s Personal Information Protection Law, Data Security Law, Cybersecurity Law, GDPR, India’s PDPB, and the US Data Breach Notification Law can lead to fines, forced shutdowns, and reputational damage.
Data‑Security Laws & Compliance
Domestic: Personal Information Protection Law, Data Security Law, Cybersecurity Law.
Overseas: GDPR (EU), PDPB (India), DBNL (US).
Sensitive Data Governance Challenges
Very tight remediation timeline (≈2.5 months) for a large‑scale system.
Over 300 application modules across the internet team are affected.
Regulatory requirements may evolve, requiring continuous updates.
The core problem is identifying what constitutes sensitive data, discovering it, and establishing a long‑term security management regime.
Sensitive Data Field Discovery
Definition
Sensitive data includes user personal information (names, IDs, contact details) and device‑related data (IMEI, operation logs). Derived data such as content‑platform posts, user profiles, and financial records are also considered sensitive.
Data‑Level Classification
Vivo follows the Personal Information Protection Law and Data Security Law to assign a risk level to each data field based on impact on business operations and user privacy. High‑risk fields require masking or encryption.
Current Issues & Improvements
New fields often lack an initial classification and are only labeled after use.
Field‑recognition engine mainly supports user‑type data; coverage for non‑user data is weak.
Legacy data still requires manual labeling, leading to high workload.
No quantitative evaluation metrics, causing quality uncertainty.
Vivo built an automated sensitive‑field discovery pipeline that scans MySQL, TiDB, Elasticsearch, and MongoDB instances, assigns a classification score, and incorporates human correction feedback to continuously improve accuracy.
Key metrics:
Coverage: Classified data volume / total data volume.
Accuracy: (Correctly classified user data × 0.1) / (Sampled user data × 0.1 + Sampled non‑user data × 0.9). The 1:9 weighting reflects the typical proportion of user vs. non‑user records.
Data Encryption & Decryption
Solution Options
Custom MyBatis plugin: High integration cost; requires code changes; Java‑only.
ShardingSphere (vivo‑JDBC): Similar cost and language limitation.
Database‑level transparent proxy: Low integration cost; works with any MySQL‑compatible database; language‑agnostic; supports client‑side decryption.
The proxy approach offers the best balance of cost, compatibility, and operational simplicity.
Proxy Encryption Mechanics
The proxy intercepts SQL statements and result sets, applying predefined masking rules. Three field types are defined:
Logical field: The name used by applications.
Plain field: Stores raw data.
Cipher field: Stores encrypted data.
During a write, the proxy encrypts the plain value into the cipher field; during a read, it decrypts the cipher field back to the logical field, ensuring transparent operation.
Proxy Advantages & Limitations
Integrated into the MySQL architecture, making encryption transparent to upstream systems.
Compatible with any MySQL‑protocol database, eliminating integration barriers.
Language‑agnostic, supporting all client languages.
Cannot support column‑level calculations or comparisons on encrypted data.
Handling Existing Data
Client‑Side Encryption for Legacy Data
Read configuration to identify fields needing cleaning.
Query rows where cipher columns are NULL, e.g.,
SELECT article, dealer FROM test.shop WHERE article_cipher IS NULL OR dealer_cipher IS NULL LIMIT 10;Lock the row and fetch plain values, e.g.,
SELECT article, dealer, price FROM test.shop WHERE article = 1 AND dealer = 'A' FOR UPDATE;Update the cipher columns via the proxy, e.g.,
UPDATE test.shop SET article = 1, dealer = 'A', price = 3.53 WHERE article = 1 AND dealer = 'A';Commit and repeat until no rows remain.
gh‑ost Tool for Online DDL & Encryption
gh‑ost creates a shadow table, copies data via binlog, and swaps tables atomically. By adding cipher columns to the shadow table, encryption can be performed during the migration, eliminating separate post‑migration steps.
Data‑Link Up‑ and Down‑Stream Adaptation
Business Integration
Non‑destructive computation layer changes.
Configure masking rules.
Enable per‑user switches.
Automatic transparent encryption.
Real‑Time Data Capture
Binlog‑based capture pipelines must decrypt cipher fields before forwarding to downstream systems (e.g., Elasticsearch, Kafka). The proxy’s decryption logic is reused at the capture layer to rewrite rows back to plain values.
Summary & Outlook
Summary
Vivo has built an end‑to‑end solution for sensitive data governance: automated field discovery across MySQL, TiDB, Elasticsearch, and MongoDB; real‑time classification with ≥85 % accuracy; and transparent encryption via a MySQL proxy that supports both new and legacy data.
Outlook
Challenges
Current masking works only for MySQL‑protocol databases.
Encrypted columns cannot be used in calculations or comparisons, limiting SQL compatibility.
Focus remains on storage‑layer encryption.
Future Plans
Improve SQL compatibility to handle complex queries and writes.
Extend encryption beyond the storage layer to cover multi‑source data pipelines, achieving a unified encryption framework across the entire data ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
