A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

Why Edge‑Device LLMs Are Stuck

Running large language models on edge devices is limited by GPU memory; a full‑size 671B model requires hundreds of gigabytes, and even a 4‑bit quantized 7B model needs 6‑8 GB of VRAM.

Two traditional workarounds exist: using tiny sub‑million‑parameter models, which lack reasoning ability, or applying aggressive post‑training quantization (PTQ) to 2‑bit, which typically destroys accuracy.

AngelSlim’s Third Path: QAT‑Based 2‑Bit Compression

Tencent’s AngelSlim team proposes quantization‑aware training (QAT) to keep a 1.8B model usable at 2‑bit precision. QAT learns quantization error during training, limiting accuracy loss.

AngelSlim Feature Set

1. Quantization – supports a full family of algorithms: FP8 (static/dynamic), INT8, INT4 (GPTQ/AWQ/GPTQ‑AQ), NVFP4, plus proprietary Tequila (ternary) and Sherry (1.25‑bit).

2. Speculative Decoding – integrates Eagle3 and SpecExit, delivering 1.4‑1.9× speedup and complementing vLLM’s native speculative decoding.

3. Diffusion Model Quantization – provides FP8 quantization and cache acceleration for FLUX, Hunyuan‑Image/Video/3D, and other generative models.

Supported Model Lineup

LLM: Hunyuan series, Qwen3/2.5, DeepSeek V3/R1, GLM‑4.6

VLM: Hunyuan‑VL, HunyuanOCR, Qwen3‑VL, Qwen2.5‑VL

Diffusion: Hunyuan‑Image/Video/3D, FLUX, Wan, SDXL

Audio: Qwen3‑Omni, Qwen2‑Audio, Fun‑CosyVoice3

HY‑1.8B‑2Bit: The 2‑Bit Edge Champion

Released on 2026‑02‑09, this model is built from Hunyuan‑1.8B‑Instruct using QAT.

Weight size reduced to 2‑bit.

Accuracy drop versus full‑precision: only 3.97%.

Gap to INT4 version: 0.13% while using half the storage.

Outperforms a full‑precision 0.5B model by 16%.

These numbers show a 2‑bit 1.8B model matches a 4‑bit counterpart with roughly half the footprint and surpasses a dense 0.5B model across metrics.

Dual‑CoT Reasoning

The model inherits Hunyuan‑1.8B‑Instruct’s “full‑blood thinking” and adds a Dual‑CoT strategy:

Short CoT : fast answers for simple queries, low latency.

Long CoT : deep reasoning for complex questions, higher accuracy.

This design balances user experience and compute resources on edge devices.

Running on a MacBook M4

HY‑1.8B‑2Bit is distributed in GGUF format and can be executed with llama.cpp. It currently requires CPUs with the SME2 instruction set (Apple M4, vivo x300, other Arm SME2 CPUs). M1/M2/M3 are not yet supported.

# 1. Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# 2. Switch to SME2 branch
git fetch origin pull/19357/head:pr-19357-sme2-int2
git checkout pr-19357-sme2-int2

# 3. Build (enable KleidiAI, disable Metal)
mkdir build && cd build
cmake -DGGML_CPU_KLEIDIAI=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF ..
make -j8

# 4. Download GGUF model from HuggingFace
# https://huggingface.co/AngelSlim/HY-1.8B-2Bit-GGUF

# 5. Quantize if needed (fp16 → q2_0)
./bin/llama-quantize hunyuan-fp16-qdq.gguf hunyuan-q2_0.gguf q2_0c

Inference command examples:

export GGML_KLEIDIAI_SME=1

# With reasoning (short CoT)
./bin/llama-cli -m hunyuan-q2_0.gguf -p "写一副春联" -t 1 --seed 4568 -n 32

# Without reasoning (faster)
./bin/llama-cli -m hunyuan-q2_0.gguf -p "/no_think写一副春联" -t 1 --seed 4568 -n 32

Key parameters: -t 1 (thread count), -n 32 (token count), /no_think prefix to skip the CoT chain.

Server‑Side Deployment

Quantized models can be served with vLLM or SGLang:

pip install angelslim
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8

Both expose an OpenAI‑compatible API; AngelSlim’s advantage is broader algorithm support (including 2‑bit and 1.25‑bit) and built‑in calibration data.

Proprietary Quantization Algorithms

Tequila : ternary quantization with weights limited to {-1, 0, +1} for extreme compression.

Sherry : 1.25‑bit quantization, even more aggressive than 2‑bit while remaining hardware‑friendly.

1.25‑bit means each weight is represented by an average of 1.25 bits, approaching the theoretical limit of model compression.

Pros and Cons

✅ Full suite of quantization methods (FP8 to 1.25‑bit) in a single framework.

✅ HY‑1.8B‑2Bit loses only ~4% accuracy, outperforming a dense 0.5B model.

✅ Supports GGUF export and llama.cpp inference on MacBook M4.

✅ Server‑side deployment via vLLM/SGLang is production‑ready.

✅ Wide model coverage (DeepSeek, Qwen, Hunyuan, FLUX, etc.).

✅ Dual‑CoT design is practical for edge scenarios.

⚠️ 2‑bit GGUF works only on SME2 devices; M1/M2/M3 are unsupported.

⚠️ 1.8B model’s intrinsic capability ceiling limits very complex tasks.

⚠️ Documentation is primarily English; Chinese docs are still in progress.

⚠️ Tequila and Sherry are experimental.

Target Audience

Edge‑AI developers (M4 devices preferred).

Teams needing large‑scale quantized model deployment.

Researchers interested in quantization techniques.

Broader Implication

AngelSlim exemplifies a trend: the future of large models is not only bigger but also smaller, faster, and more efficient.

When a 0.5B dense model and a 2‑bit 1.8B model occupy the same storage, the latter is ~22% stronger in mathematics and ~21% stronger in programming tasks, demonstrating a dimensionality‑reduction advantage.

Technical report: https://huggingface.co/AngelSlim/HY-1.8B-2Bit/blob/main/AngelSlim_Technical_Report.pdf

Project links:

GitHub: https://github.com/Tencent/AngelSlim

HY‑1.8B‑2Bit weights: https://huggingface.co/AngelSlim/HY-1.8B-2Bit

GGUF version: https://huggingface.co/AngelSlim/HY-1.8B-2Bit-GGUF

Documentation: https://angelslim.readthedocs.io/

model compressionQuantizationlarge language modelsSpeculative DecodingQATGGUFAngelSlim
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.