ternary quantization — 2 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?

The article analyzes the ROMA and TOM architectures that embed large‑language‑model weights in on‑chip ROM + SRAM, achieving up to 20,000 tokens/s inference speed, compares them with GPU and Taalas solutions, and discusses their impact on edge AI, embodied intelligence, extreme environments, and privacy.

AI acceleratorEdge computingLLM

0 likes · 11 min read

Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?

Tencent Technical Engineering

Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM quantizationdynamic bias

0 likes · 9 min read

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?