How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila: 1.58‑Bit Ternary Quantization for Large Language Models

Recent advances in large language model (LLM) quantization have highlighted 1.58‑bit ternary quantization, used in methods such as BitNet. Tencent’s new algorithm, Tequila, addresses the “dead‑zone trap” during quantization‑aware training (QAT) and achieves state‑of‑the‑art performance.

In ternary quantization, weights are constrained to {-1, 0, +1}, turning matrix multiplication into addition and greatly reducing computational complexity, which is advantageous for edge and low‑power AI deployment.

The aggressive compression, however, introduces significant information loss and training overhead, often leading to accuracy degradation. A key issue is the “dead zone” where zero‑valued weights receive no useful gradient signal during STE‑based training, causing them to remain inactive.

Tequila solves this by repurposing dead‑zone weights as dynamic bias terms, providing continuous signals that restore gradient flow and model capacity with negligible inference overhead.

Key Innovations

Minima Reactivation : Reactivates zero weights as -0 and +0, forming a four‑value representation {-1, -0, +0, +1} while preserving ternary computation efficiency.

Dynamic Offline Bias : Replaces non‑differentiable STE with a smooth, differentiable quantization function, injecting bias λᵢ for dead‑zone weights and allowing offline pre‑computation to keep inference cost near zero.

Retention of Input Information : Reactivated weights participate in ternary matrix multiplication, preserving essential input signals and delivering richer gradient information.

These designs enable Tequila to deliver five major advantages: increased model capacity, dead‑zone‑free optimization, stable training, plug‑and‑play integration with existing ternary pipelines, and almost zero additional inference cost.

Experimental Results

Compared with standard QAT and other ternary methods, Tequila achieves SOTA performance on a 10‑billion‑token dataset, improving benchmark scores by roughly 3 % and accelerating CPU inference by 2–3×. Loss curves show significantly faster convergence, confirming the effectiveness of dead‑zone reactivation.

Conclusion

Tequila opens a new direction for efficient model compression by eliminating the dead‑zone trap with adaptive dynamic bias, achieving near‑full‑precision accuracy while retaining the computational benefits of ternary quantization. The code and paper are available at:

GitHub: https://github.com/Tencent/AngelSlim

Paper: https://arxiv.org/abs/2509.23809

model compressionAI inferencedynamic biasLLM quantizationternary quantization
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.