Tagged articles
4 articles
Page 1 of 1
AI Engineer Programming
AI Engineer Programming
Apr 25, 2026 · Artificial Intelligence

Quantization Across Signal Processing, AI Inference, and RAG Vector Search

This article explains how quantization—originating from signal processing—reduces precision to save resources, details its application to neural network weights and activations via PTQ, QAT, GPTQ, AWQ, and SmoothQuant, and shows how vector quantization enables fast, memory‑efficient retrieval in large‑scale RAG systems.

AWQGPTQLLM
0 likes · 19 min read
Quantization Across Signal Processing, AI Inference, and RAG Vector Search
AI Algorithm Path
AI Algorithm Path
Apr 22, 2025 · Artificial Intelligence

Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained

The article walks through the fundamentals of large‑language‑model quantization, presenting a concrete int8 example, detailed explanations of GPTQ, GGUF/GGML, QAT, and AWQ methods, and provides step‑by‑step code snippets, formulas, calibration procedures, and performance observations for each technique.

AWQGGMLGGUF
0 likes · 15 min read
Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained
Architect
Architect
Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ
0 likes · 25 min read
How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques
DeWu Technology
DeWu Technology
Jul 5, 2023 · Artificial Intelligence

Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI

The article explains how LoRA and its 4‑bit QLoRA extension dramatically reduce trainable parameters and GPU memory for fine‑tuning large language models, while GPTQ post‑training quantization compresses weights for cheap inference, and shows how KubeAI integrates these techniques into a one‑click workflow for 7 B, 13 B, and 33 B models from data upload to API deployment.

GPTQKubeAILarge Language Models
0 likes · 13 min read
Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI