How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs
Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.
Tequila: 1.58‑Bit Ternary Quantization for Large Language Models
Recent advances in large language model (LLM) quantization have highlighted 1.58‑bit ternary quantization, used in methods such as BitNet. Tencent’s new algorithm, Tequila, addresses the “dead‑zone trap” during quantization‑aware training (QAT) and achieves state‑of‑the‑art performance.
In ternary quantization, weights are constrained to {-1, 0, +1}, turning matrix multiplication into addition and greatly reducing computational complexity, which is advantageous for edge and low‑power AI deployment.
The aggressive compression, however, introduces significant information loss and training overhead, often leading to accuracy degradation. A key issue is the “dead zone” where zero‑valued weights receive no useful gradient signal during STE‑based training, causing them to remain inactive.
Tequila solves this by repurposing dead‑zone weights as dynamic bias terms, providing continuous signals that restore gradient flow and model capacity with negligible inference overhead.
Key Innovations
Minima Reactivation : Reactivates zero weights as -0 and +0, forming a four‑value representation {-1, -0, +0, +1} while preserving ternary computation efficiency.
Dynamic Offline Bias : Replaces non‑differentiable STE with a smooth, differentiable quantization function, injecting bias λᵢ for dead‑zone weights and allowing offline pre‑computation to keep inference cost near zero.
Retention of Input Information : Reactivated weights participate in ternary matrix multiplication, preserving essential input signals and delivering richer gradient information.
These designs enable Tequila to deliver five major advantages: increased model capacity, dead‑zone‑free optimization, stable training, plug‑and‑play integration with existing ternary pipelines, and almost zero additional inference cost.
Experimental Results
Compared with standard QAT and other ternary methods, Tequila achieves SOTA performance on a 10‑billion‑token dataset, improving benchmark scores by roughly 3 % and accelerating CPU inference by 2–3×. Loss curves show significantly faster convergence, confirming the effectiveness of dead‑zone reactivation.
Conclusion
Tequila opens a new direction for efficient model compression by eliminating the dead‑zone trap with adaptive dynamic bias, achieving near‑full‑precision accuracy while retaining the computational benefits of ternary quantization. The code and paper are available at:
GitHub: https://github.com/Tencent/AngelSlim
Paper: https://arxiv.org/abs/2509.23809
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
