Simplify Large‑Scale Database Security: Discover, Classify & Encrypt Sensitive Data

This article outlines Vivo’s comprehensive approach to large‑scale database security, covering recent data‑leak incidents, legal compliance, the current challenges of sensitive data governance, a field‑level discovery method, classification standards, automated scoring, and multiple encryption solutions including Proxy, MyBatis, and ShardingSphere.

dbaplus Community
dbaplus Community
dbaplus Community
Simplify Large‑Scale Database Security: Discover, Classify & Encrypt Sensitive Data

Background

Recent high‑profile data‑leak incidents (e.g., Facebook, T‑Mobile, Amazon) have highlighted severe risks when databases are not properly secured. Non‑compliance with regulations such as China’s Personal Information Protection Law, Data Security Law, Cybersecurity Law, GDPR, India’s PDPB, and the US Data Breach Notification Law can lead to fines, forced shutdowns, and reputational damage.

Data‑Security Laws & Compliance

Domestic: Personal Information Protection Law, Data Security Law, Cybersecurity Law.

Overseas: GDPR (EU), PDPB (India), DBNL (US).

Sensitive Data Governance Challenges

Very tight remediation timeline (≈2.5 months) for a large‑scale system.

Over 300 application modules across the internet team are affected.

Regulatory requirements may evolve, requiring continuous updates.

The core problem is identifying what constitutes sensitive data, discovering it, and establishing a long‑term security management regime.

Sensitive Data Field Discovery

Definition

Sensitive data includes user personal information (names, IDs, contact details) and device‑related data (IMEI, operation logs). Derived data such as content‑platform posts, user profiles, and financial records are also considered sensitive.

Sensitive data categories
Sensitive data categories

Data‑Level Classification

Vivo follows the Personal Information Protection Law and Data Security Law to assign a risk level to each data field based on impact on business operations and user privacy. High‑risk fields require masking or encryption.

Data classification diagram
Data classification diagram

Current Issues & Improvements

New fields often lack an initial classification and are only labeled after use.

Field‑recognition engine mainly supports user‑type data; coverage for non‑user data is weak.

Legacy data still requires manual labeling, leading to high workload.

No quantitative evaluation metrics, causing quality uncertainty.

Vivo built an automated sensitive‑field discovery pipeline that scans MySQL, TiDB, Elasticsearch, and MongoDB instances, assigns a classification score, and incorporates human correction feedback to continuously improve accuracy.

Key metrics:

Coverage: Classified data volume / total data volume.

Accuracy: (Correctly classified user data × 0.1) / (Sampled user data × 0.1 + Sampled non‑user data × 0.9). The 1:9 weighting reflects the typical proportion of user vs. non‑user records.

Metrics diagram
Metrics diagram

Data Encryption & Decryption

Solution Options

Custom MyBatis plugin: High integration cost; requires code changes; Java‑only.

ShardingSphere (vivo‑JDBC): Similar cost and language limitation.

Database‑level transparent proxy: Low integration cost; works with any MySQL‑compatible database; language‑agnostic; supports client‑side decryption.

The proxy approach offers the best balance of cost, compatibility, and operational simplicity.

Solution comparison
Solution comparison

Proxy Encryption Mechanics

The proxy intercepts SQL statements and result sets, applying predefined masking rules. Three field types are defined:

Logical field: The name used by applications.

Plain field: Stores raw data.

Cipher field: Stores encrypted data.

During a write, the proxy encrypts the plain value into the cipher field; during a read, it decrypts the cipher field back to the logical field, ensuring transparent operation.

Proxy encryption flow
Proxy encryption flow

Proxy Advantages & Limitations

Integrated into the MySQL architecture, making encryption transparent to upstream systems.

Compatible with any MySQL‑protocol database, eliminating integration barriers.

Language‑agnostic, supporting all client languages.

Cannot support column‑level calculations or comparisons on encrypted data.

Handling Existing Data

Client‑Side Encryption for Legacy Data

Read configuration to identify fields needing cleaning.

Query rows where cipher columns are NULL, e.g.,

SELECT article, dealer FROM test.shop WHERE article_cipher IS NULL OR dealer_cipher IS NULL LIMIT 10;

Lock the row and fetch plain values, e.g.,

SELECT article, dealer, price FROM test.shop WHERE article = 1 AND dealer = 'A' FOR UPDATE;

Update the cipher columns via the proxy, e.g.,

UPDATE test.shop SET article = 1, dealer = 'A', price = 3.53 WHERE article = 1 AND dealer = 'A';

Commit and repeat until no rows remain.

Legacy data encryption workflow
Legacy data encryption workflow

gh‑ost Tool for Online DDL & Encryption

gh‑ost creates a shadow table, copies data via binlog, and swaps tables atomically. By adding cipher columns to the shadow table, encryption can be performed during the migration, eliminating separate post‑migration steps.

gh‑ost online DDL diagram
gh‑ost online DDL diagram

Data‑Link Up‑ and Down‑Stream Adaptation

Business Integration

Non‑destructive computation layer changes.

Configure masking rules.

Enable per‑user switches.

Automatic transparent encryption.

Upstream integration diagram
Upstream integration diagram

Real‑Time Data Capture

Binlog‑based capture pipelines must decrypt cipher fields before forwarding to downstream systems (e.g., Elasticsearch, Kafka). The proxy’s decryption logic is reused at the capture layer to rewrite rows back to plain values.

Real‑time capture decryption flow
Real‑time capture decryption flow

Summary & Outlook

Summary

Vivo has built an end‑to‑end solution for sensitive data governance: automated field discovery across MySQL, TiDB, Elasticsearch, and MongoDB; real‑time classification with ≥85 % accuracy; and transparent encryption via a MySQL proxy that supports both new and legacy data.

Solution overview
Solution overview

Outlook

Challenges

Current masking works only for MySQL‑protocol databases.

Encrypted columns cannot be used in calculations or comparisons, limiting SQL compatibility.

Focus remains on storage‑layer encryption.

Future Plans

Improve SQL compatibility to handle complex queries and writes.

Extend encryption beyond the storage layer to cover multi‑source data pipelines, achieving a unified encryption framework across the entire data ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

compliancedata securitydata classificationdatabase encryptionsensitive data
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.