Introduction to Federated Learning: Concepts, Key Technologies, and the Dianshi Federated Learning Platform
This article introduces the concept of federated learning, outlines its industry opportunities and challenges, explains the evolution of data‑sharing technologies, details core techniques such as MPC, TEE, and differential privacy, and presents the architecture and capabilities of the Dianshi federated learning platform.
Reading Guide: With the rapid development of information technology, society has entered the big‑data era. IDC predicts that global data will reach 163 ZB by 2025, ten times the current amount, while 98 % of enterprise data remains in isolated silos, preventing full utilization of its value. As data volumes grow, the problem of data flow becomes especially critical.
Today's presentation will cover three points:
Federated learning concept
Key technologies of federated learning
Dianshi federated learning platform
Speaker: Peng Shengbo, Senior R&D Engineer, Baidu Dianshi (Security Department)
Editor: Sun Yinggang, Harbin Institute of Technology
Produced by: DataFun
01
Federated Learning Concept
First, we share the concept of federated learning.
1. Industry Opportunities and Challenges
In recent years, data security and privacy protection have become focal points for governments, companies, and the public. Regulations are continuously improving, enforcement is strengthening, and typical cases are increasing.
The big‑data era brings several opportunities and challenges:
Rapid growth of the big‑data industry, surpassing a trillion‑yuan scale.
Illegal data trading and privacy leaks.
Fast development of privacy‑protection technologies.
Emergence of international policies and regulations.
Domestic policy releases.
Companies establishing data‑security standards.
In a typical marketing scenario, multiple parties need to combine data for modeling. First‑party data are the company's own consumer records; second‑party data come directly from audiences (e.g., ad platform click‑through data); third‑party data are provided by external platforms to enrich the first‑party data. Because third‑party data reside in different enterprises, sharing them is difficult.
2. Evolution of Data‑Sharing Technologies
Data‑sharing technologies can be divided into three generations:
First generation: Encrypted USB drives used by designated personnel, often in banks or financial institutions, with data desensitization, anonymization, or hashing.
Second generation: Secure sandbox isolation. Data providers place data in a sandbox; requesters apply for access and use the data for modeling within the sandbox.
Third generation: Decentralized “federated computation,” the current mainstream approach.
3. Federated Learning Scenarios Overview
Examples include the “millionaire problem” proposed by Prof. Yao in 1982, where two wealthy individuals compare wealth without revealing exact values, requiring secure multi‑party computation, and privacy‑preserving set intersection (PSI) where two phones find common contacts without exposing other contacts.
02
Key Technologies of Federated Learning
The Dianshi platform’s key technologies include:
Multi‑Party Computation (MPC)
Trusted Execution Environment (TEE)
Differential Privacy (DP)
Federated Learning (FL)
1. Multi‑Party Computation (MPC)
MPC supports additive and multiplicative secure computation. For addition, each party splits its data, shares one part with the other, performs local computation, and then sums the results. Multiplication involves pre‑generated triples (a, b, c) and a reconstruction step to obtain the final product.
2. Trusted Execution Environment (TEE)
TEE is a hardware‑based solution that provides a secure enclave for applications. It offers two security levels: software‑only attack resistance and combined software‑hardware attack resistance. Advantages include support for complex algorithms, high computation efficiency (3‑4× overhead compared to plaintext, far lower than MPC), and strong resistance to malicious attacks.
Typical TEE use cases are protecting sensitive data such as passwords, facial recognition, and mobile payment credentials. Major TEE implementations include Intel SGX, Arm TrustZone, HaiGuang SEV, RISC‑V Keystone, and Baidu’s open‑source MesaTEE.
3. Differential Privacy (DP)
DP adds controlled noise to query results to protect individual records. It can be interactive (query‑based) or non‑interactive (publishing a noisy dataset). The example shows a hospital’s outpatient records where DP prevents an attacker from inferring a specific patient’s diagnosis even with auxiliary knowledge.
Key DP concept: adjacent datasets differ by only a few records. An algorithm that produces indistinguishable output distributions on adjacent datasets is considered differentially private.
4. Federated Learning (FL)
FL is a mainstream privacy‑preserving technique, comprising horizontal FL, vertical FL, and transfer FL. Horizontal FL addresses sample scarcity by aggregating model parameters from many edge devices (e.g., input‑method keyboards). Vertical FL solves feature‑dimension scarcity by jointly modeling data from different companies that share a common user base.
03
Dianshi Federated Learning Platform
1. Platform Overview
The platform lowers the entry barrier for MPC, TEE, DP, and FL by providing a distributed architecture that supports data verification, joint analysis, joint modeling, and data desensitization.
Supports multiple participants, reducing the difficulty of using advanced privacy technologies.
Optimized engineering enables concurrent computation; the platform can handle 1 billion × 1 billion private set intersections within an hour, and supports federated training on tens of millions of features.
Offers a DSL‑based task configuration that is programmable and extensible.
Supports both private‑cloud and SaaS deployment, allowing flexible on‑premise installation.
2. Platform Design
The architecture consists of six layers:
Runtime layer: Supports physical machines, Docker, Kubernetes, and Slurm.
Communication layer: Uses TLS/HTTPS to secure inter‑party and coordinator‑party communications.
Scheduling layer: Coordinates computation tasks among participants.
Component layer: Plug‑in architecture for algorithms; leverages Spark and Calcite for heterogeneous data access.
3. Computing Contracts (DSL)
The DSL workflow includes writing, authorizing, and executing contracts. Data sets are imported during DSL authoring, and authorization mechanisms allow flexible data access.
4. Algorithm Support
The platform supports a wide range of algorithms, including data verification, federated learning, MPC, and feature engineering, applicable to scenarios such as blacklist sharing, multi‑head loan verification, and governmental queries.
5. Extensibility
The distributed architecture reduces ciphertext computation and communication overhead, enabling scalable privacy‑preserving set intersection and other tasks.
6. Flexible Deployment
The platform can be deployed as SaaS or on‑premises. Coordinators run in public clouds (e.g., Baidu Cloud), while compute nodes are privately deployed but maintain secure network connections to the coordinator.
7. Certifications and Industry Standards
Over recent years, the Dianshi team has earned multiple industry certifications, contributed to dozens of national and international standards, and released several open‑source reference implementations.
Thank you for listening.
Missed the live session? Click the replay link to watch the recording.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.