LLM quantization — 6 Technical Articles

Apr 1, 2026 · Artificial Intelligence

TurboQuant’s Alleged Misconduct: Google’s Reply Sparks Bigger Controversy

The TurboQuant paper on LLM quantization has ignited a heated debate over alleged academic misconduct, with the authors’ OpenReview rebuttal drawing criticism for downplaying prior work, misrepresenting benchmarks, and prompting broader concerns about research integrity in AI.

AI research integrityLLM quantizationRaBitQ

0 likes · 9 min read

TurboQuant’s Alleged Misconduct: Google’s Reply Sparks Bigger Controversy

Old Zhang's AI Learning

Mar 26, 2026 · Artificial Intelligence

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Google’s TurboQuant reduces KV‑Cache memory by up to 4.6×, speeds 3‑bit attention computation up to 8× on H100, and delivers near‑zero accuracy loss across long‑context benchmarks, with open‑source implementations for Metal, vLLM and llama.cpp.

GoogleKV cacheLLM quantization

0 likes · 10 min read

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Old Meng AI Explorer

Dec 29, 2025 · Artificial Intelligence

Run 100B LLMs on a Laptop: How BitNet’s 1‑bit Quantization Makes It Possible

BitNet’s 1‑bit quantization shrinks model size and compute needs by tenfold, enabling ordinary CPUs and low‑power ARM devices to run 2B‑100B language models locally with acceptable speed, low power consumption, and near‑original quality, while providing simple installation and optional GPU acceleration.

BitNetCPU inferenceLLM quantization

0 likes · 10 min read

Run 100B LLMs on a Laptop: How BitNet’s 1‑bit Quantization Makes It Possible

Old Meng AI Explorer

Dec 25, 2025 · Artificial Intelligence

Run 100B LLM on a Laptop: BitNet’s 1‑Bit Quantization Enables CPU‑Only AI

BitNet, Microsoft’s open‑source 1‑bit quantization framework, shrinks model size by up to ten‑fold and lets ordinary CPUs—including i7 laptops and ARM tablets—run 2B‑100B language models at usable speeds while cutting power consumption dramatically, offering a practical, GPU‑free solution for local AI.

BitNetCPU inferenceLLM quantization

0 likes · 9 min read

Run 100B LLM on a Laptop: BitNet’s 1‑Bit Quantization Enables CPU‑Only AI

Tencent Technical Engineering

Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM quantizationdynamic bias

0 likes · 9 min read

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Architect

Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ

0 likes · 25 min read

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques