Tag

Model Quantization

0 views collected around this technical thread.

Architect
Architect
Apr 21, 2025 · Artificial Intelligence

Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption

Microsoft Research introduced BitNet b1.58 2B4T, a native 1‑bit large language model with 2 billion parameters trained on 4 trillion tokens, achieving only 0.4 GB non‑embedding memory, 0.028 J decoding energy, and 29 ms CPU latency while matching full‑precision performance.

1-bit LLMAI researchBitNet
0 likes · 7 min read
Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption
Baidu Tech Salon
Baidu Tech Salon
Nov 10, 2023 · Artificial Intelligence

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

BaiduGPU optimizationInference System
0 likes · 13 min read
Baidu Search Deep Learning Model Architecture and Optimization Practices
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI optimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
DataFunSummit
DataFunSummit
Nov 20, 2022 · Artificial Intelligence

NLP Technology Applications and Research in Voice Assistants

This article presents an in‑depth overview of NLP techniques used in voice assistants, covering the end‑to‑end conversational AI pipeline, intent and slot modeling, multi‑turn dialog management, model deployment pipelines, quantization methods, and self‑learning strategies for continuous improvement.

Model QuantizationNLPVoice Assistant
0 likes · 30 min read
NLP Technology Applications and Research in Voice Assistants
DataFunTalk
DataFunTalk
Dec 19, 2019 · Artificial Intelligence

Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions

This article reviews neural‑network model quantization, explaining why quantization is needed, detailing forward‑ and backward‑propagation issues, presenting three main mitigation strategies, discussing subsequent pruning, performance‑recovery techniques, and outlining future research avenues in efficient machine learning.

Model Quantizationefficient machine learninghardware acceleration
0 likes · 27 min read
Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions