Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?

The article analyzes the ROMA and TOM architectures that embed large‑language‑model weights in on‑chip ROM + SRAM, achieving up to 20,000 tokens/s inference speed, compares them with GPU and Taalas solutions, and discusses their impact on edge AI, embodied intelligence, extreme environments, and privacy.

AI acceleratorEdge computingLLM
0 likes · 11 min read
Can ROM‑Based LLM Accelerators Reach 20,000 tokens/s and End the GPU Era?
Tencent Technical Engineering
Tencent Technical Engineering
Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM quantizationdynamic bias
0 likes · 9 min read
How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs