Tagged articles

Quantization-Aware Training

3 articles · Page 1 of 1

Jun 9, 2026 · Artificial Intelligence

Google Pushes Full Throttle: Run Gemma 4 Large Models Locally with MTP Acceleration

Google’s Gemma 4 QAT release compresses models to under 1 GB, enabling 26B‑parameter MoE inference on a 16 GB MacBook and mobile‑optimized versions under 1 GB, while preserving quality through Quantization‑Aware Training and offering a full toolchain for local deployment.

Gemma 4Local LLM DeploymentMTP

0 likes · 10 min read

Google Pushes Full Throttle: Run Gemma 4 Large Models Locally with MTP Acceleration

Architect

Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ

0 likes · 25 min read

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

Baidu Geek Talk

Jun 26, 2023 · Artificial Intelligence

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

Baidu applied large‑scale INT8 quantization to its ERNIE search semantic models, achieving over 25% inference speedup with less than 1% degradation in relevance metrics by selectively quantizing less‑sensitive fully‑connected layers, using automated calibration, hyper‑parameter tuning, and techniques such as QAT and SmoothQuant, while paving the way for even lower‑bit quantization and token pruning.

ERNIEINT8 QuantizationQuantization-Aware Training

0 likes · 15 min read