Tagged articles

GPTQ

6 articles · Page 1 of 1
Raymond Ops
Raymond Ops
Jun 27, 2026 · Artificial Intelligence

vLLM Quantized Inference: Loading AWQ/GPTQ Models and Optimizing GPU Memory

This article provides a step‑by‑step guide on using vLLM to load AWQ and GPTQ quantized large language models, covering environment setup, calibration data preparation, model quantization, deployment scripts, performance benchmarking, accuracy checks, best‑practice recommendations, and troubleshooting tips for GPU memory optimization.

AWQGPTQGPU memory optimization
0 likes · 45 min read
vLLM Quantized Inference: Loading AWQ/GPTQ Models and Optimizing GPU Memory
MaGe Linux Operations
MaGe Linux Operations
Jun 17, 2026 · Artificial Intelligence

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

This article explains how INT8, INT4, bitsandbytes, GPTQ, and AWQ quantization methods can dramatically cut memory usage, boost inference speed, and lower costs for large language models, while detailing their trade‑offs, practical workflows, benchmark results, and common pitfalls to help engineers decide which technique best fits their production scenario.

AWQGPTQINT4
0 likes · 22 min read
Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production
AI Engineer Programming
AI Engineer Programming
Apr 25, 2026 · Artificial Intelligence

Quantization Across Signal Processing, AI Inference, and RAG Vector Search

This article explains how quantization—originating from signal processing—reduces precision to save resources, details its application to neural network weights and activations via PTQ, QAT, GPTQ, AWQ, and SmoothQuant, and shows how vector quantization enables fast, memory‑efficient retrieval in large‑scale RAG systems.

AWQGPTQLLM
0 likes · 19 min read
Quantization Across Signal Processing, AI Inference, and RAG Vector Search
AI Algorithm Path
AI Algorithm Path
Apr 22, 2025 · Artificial Intelligence

Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained

The article walks through the fundamentals of large‑language‑model quantization, presenting a concrete int8 example, detailed explanations of GPTQ, GGUF/GGML, QAT, and AWQ methods, and provides step‑by‑step code snippets, formulas, calibration procedures, and performance observations for each technique.

AWQGGMLGGUF
0 likes · 15 min read
Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained
Architect
Architect
Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ
0 likes · 25 min read
How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques
DeWu Technology
DeWu Technology
Jul 5, 2023 · Artificial Intelligence

Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI

The article explains how LoRA and its 4‑bit QLoRA extension dramatically reduce trainable parameters and GPU memory for fine‑tuning large language models, while GPTQ post‑training quantization compresses weights for cheap inference, and shows how KubeAI integrates these techniques into a one‑click workflow for 7 B, 13 B, and 33 B models from data upload to API deployment.

GPTQKubeAILarge Language Models
0 likes · 13 min read
Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI