Nimbus: Secure and Efficient Two‑Party Inference for Transformers
The paper introduces Nimbus, a two‑party privacy‑preserving inference framework for Transformer models that leverages a client‑side outer‑product linear‑layer protocol and distribution‑aware polynomial approximations for non‑linear layers, achieving up to five‑fold speedups with negligible accuracy loss.
NeurIPS 2024 featured the paper "Nimbus: Secure and Efficient Two‑Party Inference for Transformers" by the Ant Mìsuàn team and Shanghai Jiao‑Tong University, which addresses the privacy challenges of Machine Learning as a Service (MLaaS) inference for large Transformer models.
Background. In the MLaaS setting, a model owner (server) holds a private neural‑network model while a client provides input data. Existing two‑party secure inference schemes combine homomorphic encryption (HE) and multi‑party computation (MPC), but they incur high communication and computation costs, especially for the massive matrix multiplications and non‑linear activations in Transformers.
Linear‑layer protocol. Nimbus replaces the traditional server‑side inner‑product (SIP) protocol with a client‑side outer‑product approach. By pre‑encrypting static model parameters and storing them locally, the client can multiply secret‑shared activations with encrypted parameters without any input communication. This outer‑product protocol reduces communication complexity and lowers the computational complexity of multiplication from quadratic to linear. Additionally, Nimbus compresses output ciphertexts via a “right‑shift” operation, achieving near‑100 % ciphertext utilization.
Non‑linear‑layer acceleration. For activation functions such as GELU and exponential, Nimbus introduces a distribution‑aware piecewise‑polynomial approximation. It exploits the empirical input distribution of Transformer activations (e.g., 80 % of exponential inputs lie in [‑5, 0]) to allocate fewer, lower‑degree polynomial segments where data are scarce, thereby reducing both the number of polynomial terms and the required comparison rounds. The framework also fuses ring‑upgrading with truncation to eliminate extra communication.
Experimental results. Benchmarks on BERT‑base (input length 128) show that Nimbus outperforms BumbleBee by up to 5× overall speed on LAN and 3× on WAN, with linear‑layer gains of ~10× and non‑linear gains of ~3‑4×. Accuracy tests on the GLUE benchmark reveal an average loss of only 0.57 % without fine‑tuning and 0.07 % after fine‑tuning.
Conclusion. Nimbus provides an efficient, privacy‑preserving two‑party inference solution for Transformers by redesigning linear‑layer multiplication and applying distribution‑aware polynomial approximations to non‑linear layers, substantially reducing communication and computation while maintaining model accuracy.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.