Artificial Intelligence 13 min read

De‑identification in Federated Learning: Using xID Technology to Protect Sample Intersection Information

This article explains federated learning, its vertical and horizontal variants, the privacy risks of sample intersection in financial scenarios, reviews common de‑identification methods, and introduces the xID technique with generation‑mapping services to securely protect intersecting data while enabling collaborative AI modeling.

DataFunTalk

Aug 24, 2021

De‑identification in Federated Learning: Using xID Technology to Protect Sample Intersection Information

Since the implementation of China’s Cybersecurity Law in 2017, a series of regulations such as the Personal Financial Information Protection Specification, the Personal Information De‑identification Guidelines, and the Data Security Law have strengthened personal data protection, leading to data silos that hinder value extraction.

Federated Learning (FL) emerged as a distributed machine‑learning approach that enables multiple parties to jointly train models without sharing raw data, thereby addressing data privacy and silo challenges. FL originated at Google in 2016 for on‑device model updates and has since evolved into three main categories: Horizontal FL (features overlap, users differ), Vertical FL (users overlap, features differ), and Federated Transfer Learning (both features and users differ).

In vertical FL, especially in finance, institutions align samples using Private Set Intersection (PSI). While PSI hides non‑intersection IDs, intersecting IDs remain visible to both parties, creating competitive risk (e.g., one bank identifying a shared customer and offering better terms).

The article then reviews de‑identification techniques mandated by Chinese standards, including statistical methods, cryptographic encryption, suppression, pseudonymization, generalization, randomization, and data synthesis. For financial data, generalization is most applicable because statistical and randomization methods may degrade predictive accuracy.

To overcome intersection leakage, the xID de‑identification solution—co‑developed by the Shanghai Data Exchange and the Ministry of Public Security’s Third Research Institute—provides a non‑reversible mapping of direct identifiers (e.g., ID numbers, phone numbers) to institution‑specific xID labels. The process consists of two services:

Generation Service: raw IDs are processed via the xID‑SDK and cloud or edge services to produce an institution‑specific xID label.

Mapping Service: an xID label from Institution A is transformed into the corresponding label of Institution B, enabling secure intersection matching.

Typical usage: Institution A sends its xID‑label(A) to Institution B; B maps it to xID‑label(B) and performs a lookup in its own user base. A successful match returns the desired user profile without either party learning the other's raw identifiers.

Protecting sample intersection with xID involves three steps: (1) replace direct identifiers with xID labels using the SDK; (2) transmit xID‑A to a trusted third‑party server, which converts it to a common xID‑C; (3) the server returns the mapping of intersecting xID‑C back to each institution, which then replaces its local xID‑A/B with xID‑C and re‑orders data to prevent inference from ordering.

Future work includes adding a feature to the xID‑SDK that evaluates whether sample features meet the Level‑3 personal information identification standard before processing.

Finally, the article thanks the audience and invites readers to join the DataFunTalk community for further AI and big‑data discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

privacy Federated Learning Xid Vertical FL Data De-identification

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.