Tagged articles
5 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFLarge Language Models
0 likes · 13 min read
A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression
AI Algorithm Path
AI Algorithm Path
Aug 23, 2025 · Artificial Intelligence

Understanding QAT: Quantization‑Aware Training with PyTorch

This article explains the principles of model quantization, compares post‑training quantization (PTQ) and quantization‑aware training (QAT), details the QAT workflow in PyTorch—including fake quantization, gradient handling, and code examples—and offers practical tips for achieving high‑accuracy int8/int4 models.

Fake QuantizationPyTorchQAT
0 likes · 15 min read
Understanding QAT: Quantization‑Aware Training with PyTorch
AI Algorithm Path
AI Algorithm Path
Apr 22, 2025 · Artificial Intelligence

Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained

The article walks through the fundamentals of large‑language‑model quantization, presenting a concrete int8 example, detailed explanations of GPTQ, GGUF/GGML, QAT, and AWQ methods, and provides step‑by‑step code snippets, formulas, calibration procedures, and performance observations for each technique.

AWQGGMLGGUF
0 likes · 15 min read
Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained
Meituan Technology Team
Meituan Technology Team
Sep 22, 2022 · Artificial Intelligence

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

The paper proposes a full quantization pipeline for YOLOv6 that combines a re‑parameterization optimizer, partial PTQ, channel‑wise distillation, graph‑scale merging, and GPU‑offloaded preprocessing, enabling an INT8 model to retain ~42 % mAP while delivering over 200 % throughput increase and 40 % QPS gain versus FP16.

Channel DistillationModel DeploymentPTQ
0 likes · 16 min read
Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation
JD Tech Talk
JD Tech Talk
Jan 6, 2021 · Backend Development

JDDLB Architecture and QAT SSL/TLS Hardware Acceleration Optimization

This article details the overall architecture of JD.com Data Science's JDDLB load balancer, its high‑performance and high‑availability features, and presents a comprehensive performance comparison of SSL/TLS offloading using Intel QAT acceleration cards, including async processing, user‑space driver zero‑copy implementation, crash analysis, and process‑level engine scheduling.

Hardware offloadNginxQAT
0 likes · 13 min read
JDDLB Architecture and QAT SSL/TLS Hardware Acceleration Optimization