Artificial Intelligence 20 min read

Introduction to Federated Learning: Concepts, Key Technologies, and the Dianshi Federated Learning Platform

This article introduces the concept of federated learning, outlines its industry opportunities and challenges, explains the evolution of data‑sharing technologies, details core techniques such as MPC, TEE, and differential privacy, and presents the architecture and capabilities of the Dianshi federated learning platform.

DataFunSummit

Nov 28, 2022

Introduction to Federated Learning: Concepts, Key Technologies, and the Dianshi Federated Learning Platform

Reading Guide: With the rapid development of information technology, society has entered the big‑data era. IDC predicts that global data will reach 163 ZB by 2025, ten times the current amount, while 98 % of enterprise data remains in isolated silos, preventing full utilization of its value. As data volumes grow, the problem of data flow becomes especially critical.

Today's presentation will cover three points:

Federated learning concept

Key technologies of federated learning

Dianshi federated learning platform

Speaker: Peng Shengbo, Senior R&D Engineer, Baidu Dianshi (Security Department)

Editor: Sun Yinggang, Harbin Institute of Technology

Produced by: DataFun

Federated Learning Concept

First, we share the concept of federated learning.

1. Industry Opportunities and Challenges

In recent years, data security and privacy protection have become focal points for governments, companies, and the public. Regulations are continuously improving, enforcement is strengthening, and typical cases are increasing.

The big‑data era brings several opportunities and challenges:

Rapid growth of the big‑data industry, surpassing a trillion‑yuan scale.

Illegal data trading and privacy leaks.

Fast development of privacy‑protection technologies.

Emergence of international policies and regulations.

Domestic policy releases.

Companies establishing data‑security standards.

In a typical marketing scenario, multiple parties need to combine data for modeling. First‑party data are the company's own consumer records; second‑party data come directly from audiences (e.g., ad platform click‑through data); third‑party data are provided by external platforms to enrich the first‑party data. Because third‑party data reside in different enterprises, sharing them is difficult.

2. Evolution of Data‑Sharing Technologies

Data‑sharing technologies can be divided into three generations:

First generation: Encrypted USB drives used by designated personnel, often in banks or financial institutions, with data desensitization, anonymization, or hashing.

Second generation: Secure sandbox isolation. Data providers place data in a sandbox; requesters apply for access and use the data for modeling within the sandbox.

Third generation: Decentralized “federated computation,” the current mainstream approach.

3. Federated Learning Scenarios Overview

Examples include the “millionaire problem” proposed by Prof. Yao in 1982, where two wealthy individuals compare wealth without revealing exact values, requiring secure multi‑party computation, and privacy‑preserving set intersection (PSI) where two phones find common contacts without exposing other contacts.

Key Technologies of Federated Learning

The Dianshi platform’s key technologies include:

Multi‑Party Computation (MPC)

Trusted Execution Environment (TEE)

Differential Privacy (DP)

Federated Learning (FL)

1. Multi‑Party Computation (MPC)

MPC supports additive and multiplicative secure computation. For addition, each party splits its data, shares one part with the other, performs local computation, and then sums the results. Multiplication involves pre‑generated triples (a, b, c) and a reconstruction step to obtain the final product.

2. Trusted Execution Environment (TEE)

TEE is a hardware‑based solution that provides a secure enclave for applications. It offers two security levels: software‑only attack resistance and combined software‑hardware attack resistance. Advantages include support for complex algorithms, high computation efficiency (3‑4× overhead compared to plaintext, far lower than MPC), and strong resistance to malicious attacks.

Typical TEE use cases are protecting sensitive data such as passwords, facial recognition, and mobile payment credentials. Major TEE implementations include Intel SGX, Arm TrustZone, HaiGuang SEV, RISC‑V Keystone, and Baidu’s open‑source MesaTEE.

3. Differential Privacy (DP)

DP adds controlled noise to query results to protect individual records. It can be interactive (query‑based) or non‑interactive (publishing a noisy dataset). The example shows a hospital’s outpatient records where DP prevents an attacker from inferring a specific patient’s diagnosis even with auxiliary knowledge.

Key DP concept: adjacent datasets differ by only a few records. An algorithm that produces indistinguishable output distributions on adjacent datasets is considered differentially private.

4. Federated Learning (FL)

FL is a mainstream privacy‑preserving technique, comprising horizontal FL, vertical FL, and transfer FL. Horizontal FL addresses sample scarcity by aggregating model parameters from many edge devices (e.g., input‑method keyboards). Vertical FL solves feature‑dimension scarcity by jointly modeling data from different companies that share a common user base.

Dianshi Federated Learning Platform

1. Platform Overview

The platform lowers the entry barrier for MPC, TEE, DP, and FL by providing a distributed architecture that supports data verification, joint analysis, joint modeling, and data desensitization.

Supports multiple participants, reducing the difficulty of using advanced privacy technologies.

Optimized engineering enables concurrent computation; the platform can handle 1 billion × 1 billion private set intersections within an hour, and supports federated training on tens of millions of features.

Offers a DSL‑based task configuration that is programmable and extensible.

Supports both private‑cloud and SaaS deployment, allowing flexible on‑premise installation.

2. Platform Design

The architecture consists of six layers:

Runtime layer: Supports physical machines, Docker, Kubernetes, and Slurm.

Communication layer: Uses TLS/HTTPS to secure inter‑party and coordinator‑party communications.

Scheduling layer: Coordinates computation tasks among participants.

Component layer: Plug‑in architecture for algorithms; leverages Spark and Calcite for heterogeneous data access.

3. Computing Contracts (DSL)

The DSL workflow includes writing, authorizing, and executing contracts. Data sets are imported during DSL authoring, and authorization mechanisms allow flexible data access.

4. Algorithm Support

The platform supports a wide range of algorithms, including data verification, federated learning, MPC, and feature engineering, applicable to scenarios such as blacklist sharing, multi‑head loan verification, and governmental queries.

5. Extensibility

The distributed architecture reduces ciphertext computation and communication overhead, enabling scalable privacy‑preserving set intersection and other tasks.

6. Flexible Deployment

The platform can be deployed as SaaS or on‑premises. Coordinators run in public clouds (e.g., Baidu Cloud), while compute nodes are privately deployed but maintain secure network connections to the coordinator.

7. Certifications and Industry Standards

Over recent years, the Dianshi team has earned multiple industry certifications, contributed to dozens of national and international standards, and released several open‑source reference implementations.

Thank you for listening.

Missed the live session? Click the replay link to watch the recording.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI TEE Differential Privacy MPC

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.