Artificial Intelligence 15 min read

Nimbus: Secure and Efficient Two‑Party Inference for Transformers

The paper introduces Nimbus, a two‑party privacy‑preserving inference framework for Transformer models that leverages a client‑side outer‑product linear‑layer protocol and distribution‑aware polynomial approximations for non‑linear layers, achieving up to five‑fold speedups with negligible accuracy loss.

AntTech
AntTech
AntTech
Nimbus: Secure and Efficient Two‑Party Inference for Transformers

NeurIPS 2024 featured the paper "Nimbus: Secure and Efficient Two‑Party Inference for Transformers" by the Ant Mìsuàn team and Shanghai Jiao‑Tong University, which addresses the privacy challenges of Machine Learning as a Service (MLaaS) inference for large Transformer models.

Background. In the MLaaS setting, a model owner (server) holds a private neural‑network model while a client provides input data. Existing two‑party secure inference schemes combine homomorphic encryption (HE) and multi‑party computation (MPC), but they incur high communication and computation costs, especially for the massive matrix multiplications and non‑linear activations in Transformers.

Linear‑layer protocol. Nimbus replaces the traditional server‑side inner‑product (SIP) protocol with a client‑side outer‑product approach. By pre‑encrypting static model parameters and storing them locally, the client can multiply secret‑shared activations with encrypted parameters without any input communication. This outer‑product protocol reduces communication complexity and lowers the computational complexity of multiplication from quadratic to linear. Additionally, Nimbus compresses output ciphertexts via a “right‑shift” operation, achieving near‑100 % ciphertext utilization.

Non‑linear‑layer acceleration. For activation functions such as GELU and exponential, Nimbus introduces a distribution‑aware piecewise‑polynomial approximation. It exploits the empirical input distribution of Transformer activations (e.g., 80 % of exponential inputs lie in [‑5, 0]) to allocate fewer, lower‑degree polynomial segments where data are scarce, thereby reducing both the number of polynomial terms and the required comparison rounds. The framework also fuses ring‑upgrading with truncation to eliminate extra communication.

Experimental results. Benchmarks on BERT‑base (input length 128) show that Nimbus outperforms BumbleBee by up to 5× overall speed on LAN and 3× on WAN, with linear‑layer gains of ~10× and non‑linear gains of ~3‑4×. Accuracy tests on the GLUE benchmark reveal an average loss of only 0.57 % without fine‑tuning and 0.07 % after fine‑tuning.

Conclusion. Nimbus provides an efficient, privacy‑preserving two‑party inference solution for Transformers by redesigning linear‑layer multiplication and applying distribution‑aware polynomial approximations to non‑linear layers, substantially reducing communication and computation while maintaining model accuracy.

performance optimizationmachine learningTransformersecure multi-party computationhomomorphic encryptionprivacy-preserving inference
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.