Sensitive Field Identification Using Wide & Deep and TextCNN Models
This article presents a machine‑learning approach for detecting sensitive data fields in a data warehouse by combining a Wide & Deep network with a TextCNN architecture, detailing data exploration, model design, training strategies, performance results, and deployment workflow.
Data governance aims to ensure high‑quality data throughout its lifecycle, with security being crucial for protecting sensitive and personal information; applying machine‑learning algorithms to identify sensitive fields can greatly improve both accuracy and efficiency.
The proposed solution implements a sensitive‑field identification algorithm based on a Wide & Deep network and TextCNN, structured into three parts: exploratory data analysis, model description, and the final model.
Exploratory analysis of all tables and columns in the source layer of the data warehouse shows that only about 2% of fields are sensitive, mainly including names, ID numbers, phone numbers, bank cards, and emails, revealing a severe class‑imbalance problem. Feature engineering derives additional attributes such as db_name_len and examines distributions of field length and column types, illustrated by the following figures:
The Wide & Deep network, originally proposed by Google for recommendation systems, combines a linear "wide" part that handles numeric and one‑hot encoded categorical features with a deep part that learns embeddings for categorical features; the two parts are merged to predict the probability of a field being sensitive.
TextCNN, introduced by Kim, applies convolutional neural networks to text classification; it processes sliding windows of word (or character) embeddings, applies convolution kernels, and aggregates the results to produce a classification score.
To adapt these models for sensitive‑field detection, the Wide & Deep architecture is modified so that textual features, after embedding, are processed by a CNN branch while other categorical features continue through fully‑connected layers; the revised architecture is shown below:
For TextCNN, the original English‑text design is altered to handle Chinese characters directly without tokenization, using character‑level embeddings and fixed‑length concatenation of multiple text features.
During training, the class imbalance is addressed by oversampling/undersampling and cost‑sensitive learning; data are split 70% training / 30% testing, with an internal 80%/20% train‑validation split. Hyper‑parameters include dropout 0.5 for all parts, embedding dimension 128, Adam optimizer (lr=0.001), and batch size 128.
Evaluation shows an overall accuracy of about 93% on the test set, though some classes suffer due to mislabeled samples (e.g., fields labeled as "address" that do not contain address data).
The end‑to‑end deployment workflow is illustrated in the following diagram:
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.