Artificial Intelligence 15 min read

Nimbus: Secure and Efficient Two‑Party Inference for Transformers

The paper introduces Nimbus, a two‑party privacy‑preserving inference framework for Transformer models that leverages a client‑side outer‑product linear‑layer protocol and distribution‑aware polynomial approximations for non‑linear layers, achieving up to five‑fold speedups with negligible accuracy loss.

AntTech

Dec 6, 2024

NeurIPS 2024 featured the paper "Nimbus: Secure and Efficient Two‑Party Inference for Transformers" by the Ant Mìsuàn team and Shanghai Jiao‑Tong University, which addresses the privacy challenges of Machine Learning as a Service (MLaaS) inference for large Transformer models.

Background. In the MLaaS setting, a model owner (server) holds a private neural‑network model while a client provides input data. Existing two‑party secure inference schemes combine homomorphic encryption (HE) and multi‑party computation (MPC), but they incur high communication and computation costs, especially for the massive matrix multiplications and non‑linear activations in Transformers.

Linear‑layer protocol. Nimbus replaces the traditional server‑side inner‑product (SIP) protocol with a client‑side outer‑product approach. By pre‑encrypting static model parameters and storing them locally, the client can multiply secret‑shared activations with encrypted parameters without any input communication. This outer‑product protocol reduces communication complexity and lowers the computational complexity of multiplication from quadratic to linear. Additionally, Nimbus compresses output ciphertexts via a “right‑shift” operation, achieving near‑100 % ciphertext utilization.

Non‑linear‑layer acceleration. For activation functions such as GELU and exponential, Nimbus introduces a distribution‑aware piecewise‑polynomial approximation. It exploits the empirical input distribution of Transformer activations (e.g., 80 % of exponential inputs lie in [‑5, 0]) to allocate fewer, lower‑degree polynomial segments where data are scarce, thereby reducing both the number of polynomial terms and the required comparison rounds. The framework also fuses ring‑upgrading with truncation to eliminate extra communication.

Experimental results. Benchmarks on BERT‑base (input length 128) show that Nimbus outperforms BumbleBee by up to 5× overall speed on LAN and 3× on WAN, with linear‑layer gains of ~10× and non‑linear gains of ~3‑4×. Accuracy tests on the GLUE benchmark reveal an average loss of only 0.57 % without fine‑tuning and 0.07 % after fine‑tuning.

Conclusion. Nimbus provides an efficient, privacy‑preserving two‑party inference solution for Transformers by redesigning linear‑layer multiplication and applying distribution‑aware polynomial approximations to non‑linear layers, substantially reducing communication and computation while maintaining model accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization machine learning Transformer secure multi-party computation Homomorphic Encryption privacy-preserving inference

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.