Artificial Intelligence 10 min read

Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment

Microsoft Research released BitNet b1.58 2B4T, the first open‑source native 1‑bit large language model with 2 billion parameters, 1.58‑bit effective precision and a 0.4 GB footprint, achieving full‑precision performance while enabling efficient CPU and GPU inference for edge AI applications.

DataFunTalk

Apr 19, 2025

Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment

BitNet b1.58 2B4T is the first open‑source native 1‑bit large language model (LLM) from Microsoft Research, featuring 2 billion parameters, an effective 1.58‑bit precision (parameters limited to {-1, 0, +1}), and a total size of only 0.4 GB.

The model’s novelty lies in three aspects: (1) b1.58 quantization that uses three discrete values, (2) an extremely small parameter count (2 B) resulting in a 0.4 GB memory footprint, and (3) a CPU‑optimized inference framework called BitNet.

BitNet b1.58 2B4T was trained on a 4‑trillion‑token corpus that includes large web crawls (e.g., DCLM, FineWeb‑EDU) and synthetic math data to boost reasoning abilities. After pre‑training, the model underwent supervised fine‑tuning (SFT) with various instruction‑following and dialogue datasets, followed by Direct Preference Optimization (DPO) to further align its behavior with human preferences.

Extensive benchmark evaluation—covering language understanding, world knowledge, reading comprehension, mathematics, coding, and instruction‑following—shows that BitNet b1.58 2B4T matches or exceeds full‑precision models of similar scale while using far less memory (0.4 GB) and energy, with an average decoding latency of 29 ms.

For inference, Microsoft open‑sourced a dedicated library supporting both GPU and CPU, notably the bitnet.cpp C++ implementation that provides optimized kernels for standard CPU architectures, enabling efficient execution of the mixed‑precision (1.58‑bit weight, 8‑bit activation) model.

Live demos (CPU and GPU A100) show the model generating coherent text, solving math problems at ~27 tokens/s, and writing code, confirming its practical usability despite its extreme compression.

In summary, BitNet b1.58 2B4T demonstrates that native 1‑bit quantization can deliver full‑scale LLM performance with a fraction of the memory and compute cost, opening new possibilities for edge‑AI deployment, even though challenges remain for non‑English tasks.

Authors: Wei Fu‑ru (Microsoft Distinguished Scientist), Shuming Ma (Microsoft Research Asia), and Wang Hong‑yu (CAS PhD student, intern at MSRA).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM model compression 1-bit quantization CPU inference Microsoft Research

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.