Artificial Intelligence 9 min read

Sensitive Field Identification Using Wide & Deep and TextCNN Models

This article presents a machine‑learning approach for detecting sensitive data fields in a data warehouse by combining a Wide & Deep network with a TextCNN architecture, detailing data exploration, model design, training strategies, performance results, and deployment workflow.

JD Tech Talk

Jan 10, 2019

Sensitive Field Identification Using Wide & Deep and TextCNN Models

Data governance aims to ensure high‑quality data throughout its lifecycle, with security being crucial for protecting sensitive and personal information; applying machine‑learning algorithms to identify sensitive fields can greatly improve both accuracy and efficiency.

The proposed solution implements a sensitive‑field identification algorithm based on a Wide & Deep network and TextCNN, structured into three parts: exploratory data analysis, model description, and the final model.

Exploratory analysis of all tables and columns in the source layer of the data warehouse shows that only about 2% of fields are sensitive, mainly including names, ID numbers, phone numbers, bank cards, and emails, revealing a severe class‑imbalance problem. Feature engineering derives additional attributes such as db_name_len and examines distributions of field length and column types, illustrated by the following figures:

The Wide & Deep network, originally proposed by Google for recommendation systems, combines a linear "wide" part that handles numeric and one‑hot encoded categorical features with a deep part that learns embeddings for categorical features; the two parts are merged to predict the probability of a field being sensitive.

TextCNN, introduced by Kim, applies convolutional neural networks to text classification; it processes sliding windows of word (or character) embeddings, applies convolution kernels, and aggregates the results to produce a classification score.

To adapt these models for sensitive‑field detection, the Wide & Deep architecture is modified so that textual features, after embedding, are processed by a CNN branch while other categorical features continue through fully‑connected layers; the revised architecture is shown below:

For TextCNN, the original English‑text design is altered to handle Chinese characters directly without tokenization, using character‑level embeddings and fixed‑length concatenation of multiple text features.

During training, the class imbalance is addressed by oversampling/undersampling and cost‑sensitive learning; data are split 70% training / 30% testing, with an internal 80%/20% train‑validation split. Hyper‑parameters include dropout 0.5 for all parts, embedding dimension 128, Adam optimizer (lr=0.001), and batch size 128.

Evaluation shows an overall accuracy of about 93% on the test set, though some classes suffer due to mislabeled samples (e.g., fields labeled as "address" that do not contain address data).

The end‑to‑end deployment workflow is illustrated in the following diagram:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

feature engineering Data Governance TextCNN Wide&Deep sensitive data detection

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.