Tagged articles
34 articles
Page 1 of 1
Liangxu Linux
Liangxu Linux
May 12, 2026 · Artificial Intelligence

How to Deploy Trained Neural Networks on Arduino and Raspberry Pi

Deploying large AI models to tiny embedded devices like Arduino and Raspberry Pi requires aggressive model slimming through quantization, pruning, and distillation, careful selection of runtimes such as TensorFlow Lite, and addressing power, latency, and debugging challenges to achieve real‑time inference.

ArduinoEmbedded AIModel Pruning
0 likes · 7 min read
How to Deploy Trained Neural Networks on Arduino and Raspberry Pi
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Why OpenClaw Is So Expensive and How QuantClaw Cuts Cost by 21% While Boosting Speed 15%

OpenClaw’s high token consumption drives steep costs, but the QuantClaw plug‑in dynamically routes tasks to 4‑bit, 8‑bit or 16‑bit model instances based on a systematic quantization study, achieving up to 21% cost reduction, 15% latency improvement, and even modest accuracy gains.

AI agentsCost reductionDynamic Precision Routing
0 likes · 9 min read
Why OpenClaw Is So Expensive and How QuantClaw Cuts Cost by 21% While Boosting Speed 15%
Machine Heart
Machine Heart
May 9, 2026 · Artificial Intelligence

Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?

QuantClaw, an open‑source plug‑in for the OpenClaw AI agent framework, uses a systematic quantization study to dynamically route tasks to appropriate model precisions, achieving up to 21% cost reduction, 8‑15% latency improvement, and even higher task scores across diverse workloads.

AI agentsCost OptimizationModel Quantization
0 likes · 8 min read
Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?
DaTaobao Tech
DaTaobao Tech
Apr 22, 2026 · Artificial Intelligence

How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds

MNN‑Sana‑Edit‑V2, a collaborative effort between Taobao’s Meta team and Hangzhou University, combines a frozen Qwen3‑0.6B LLM, Learnable Query, Connector, Linear DiT and Deep Compression Autoencoder with 4/8‑bit quantization to run fully on mobile devices, delivering 512×512 comic‑style conversions in about 15 seconds—2.5× faster than cloud alternatives—while providing open‑source code, detailed training stages, and extensive performance benchmarks.

Mobile AIModel Quantizationdiffusion
0 likes · 13 min read
How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds
Black & White Path
Black & White Path
Apr 8, 2026 · Artificial Intelligence

Run Massive AI Models on a Single PC: The 1‑Bit LLM Revolution

Microsoft’s open‑source bitnet.cpp transforms 100‑billion‑parameter LLM inference from GPU‑only to ordinary CPUs by replacing floating‑point matrix multiplication with integer add‑subtract, cutting energy use by 82 %, memory by 90 % and delivering up to 6× speed on x86/ARM hardware.

1-bit LLMBitNetCPU inference
0 likes · 7 min read
Run Massive AI Models on a Single PC: The 1‑Bit LLM Revolution
DataFunTalk
DataFunTalk
Apr 7, 2026 · Artificial Intelligence

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

In a four‑hour competition, algorithm engineer Zhang Zhen from a Chinese EV company detailed his end‑to‑end workflow for quantizing the massive Qwen3‑Next‑80B model, covering sensitive‑layer analysis, iterative smoothing, fallback strategies, and parallel "horse‑race" debugging that led his team to win the GeekDay challenge.

Iterative SmoothModel Quantizationlarge language models
0 likes · 9 min read
How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 16, 2026 · Artificial Intelligence

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.

Claude OpusGGUFLM Studio
0 likes · 12 min read
Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code
AI Engineering
AI Engineering
Mar 11, 2026 · Artificial Intelligence

Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

This guide shows how to replace Anthropic's API by running a local Qwen 3.5 model with llama.cpp, configuring Claude Code via ANTHROPIC_BASE_URL, and includes hardware checks, build steps, model download, server launch, speed‑fix tips, and usage instructions for secure, cost‑free development.

Anthropic APIClaude CodeGPU Acceleration
0 likes · 8 min read
Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs
MaGe Linux Operations
MaGe Linux Operations
Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU MemoryKV cacheLLM OOM
0 likes · 28 min read
Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceHardware accelerationINT4
0 likes · 16 min read
How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU
Huolala Tech
Huolala Tech
Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation
0 likes · 18 min read
How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%
Fun with Large Models
Fun with Large Models
Jan 18, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

This article walks through two mainstream local deployment solutions—high‑performance VLLM for production Linux servers and lightweight Ollama for personal Windows machines—covering environment setup, model download, server launch, API testing, key configuration parameters, and the quantization technique that makes Ollama models compact.

GPU OptimizationModel QuantizationOllama
0 likes · 18 min read
Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama
Data Party THU
Data Party THU
Nov 21, 2025 · Artificial Intelligence

Unlocking 2025 Multi-Agent AI: Core Tech, Frameworks, and Emerging Trends

This article analyzes the technical foundations, development frameworks, real‑time inference optimizations, typical industry deployments, and future research directions of multi‑agent systems in 2025, highlighting protocols like FIPA‑ACL and MCP, tools such as LangGraph and ADP3.0, and edge‑computing breakthroughs.

AI ArchitectureModel Quantizationdistributed computing
0 likes · 16 min read
Unlocking 2025 Multi-Agent AI: Core Tech, Frameworks, and Emerging Trends
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 13, 2025 · Artificial Intelligence

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

This article explains the principles, key methods, and practical effects of model quantization, pruning, and knowledge distillation, comparing their advantages and disadvantages, and showing how combining these techniques enables compact, high‑performance AI models on resource‑constrained devices.

Model PruningModel Quantizationedge AI
0 likes · 7 min read
How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices
JD Cloud Developers
JD Cloud Developers
Sep 11, 2025 · Artificial Intelligence

How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators

This article explains why migrating AI applications from Nvidia GPUs to domestic Chinese accelerators is urgent, outlines the technical challenges, and presents JD Cloud's JoyScale zero‑perception migration stack with hardware, software, model, and inference optimizations for real‑world scenarios.

AI migrationJoyScaleModel Quantization
0 likes · 10 min read
How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators
Data Thinking Notes
Data Thinking Notes
Jul 6, 2025 · Artificial Intelligence

How Quantization Shrinks Giant AI Models for Edge Devices

This article explains why quantizing massive AI models is essential for deploying them on resource‑constrained devices, outlines core quantization concepts, techniques, and methods, compares their pros and cons, and presents practical application scenarios such as smartphones, autonomous driving, IoT, and edge computing.

AI deploymentModel QuantizationPerformance Optimization
0 likes · 9 min read
How Quantization Shrinks Giant AI Models for Edge Devices
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 3, 2025 · Artificial Intelligence

Why Buying an AI Appliance Is a Strategic Pitfall for Enterprises

Enterprises rushing to purchase DeepSeek AI appliances and smart‑agent platforms often face hidden technical, data, and organizational challenges that turn promised "plug‑and‑play" solutions into costly missteps, highlighting the need for realistic strategy, robust data governance, and continuous capability building.

AI capability buildingAI deploymentData Governance
0 likes · 28 min read
Why Buying an AI Appliance Is a Strategic Pitfall for Enterprises
Architect
Architect
Apr 21, 2025 · Artificial Intelligence

Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption

Microsoft Research introduced BitNet b1.58 2B4T, a native 1‑bit large language model with 2 billion parameters trained on 4 trillion tokens, achieving only 0.4 GB non‑embedding memory, 0.028 J decoding energy, and 29 ms CPU latency while matching full‑precision performance.

1-bit LLMAI researchBitNet
0 likes · 7 min read
Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption
DaTaobao Tech
DaTaobao Tech
Apr 21, 2025 · Artificial Intelligence

How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop

Facing DeepSeek R1 server instability, the open‑source MNN LLM framework offers local, mobile‑friendly deployment with model quantization and hardware‑specific optimizations, dramatically improving inference speed, stability, and download reliability across Android, iOS, and desktop platforms while supporting multimodal inputs.

AndroidLLMMNN
0 likes · 11 min read
How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mar 31, 2025 · Artificial Intelligence

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

An in‑depth technical analysis compares popular model quantization schemes—q4_0, q5_K_M, and q8_0—detailing their precision trade‑offs, memory savings, inference speed, hardware compatibility, and ideal use‑cases, complemented by performance benchmarks on Llama‑3‑8B and practical selection guidelines.

AI OptimizationLLM PerformanceModel Quantization
0 likes · 7 min read
Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0
AI Algorithm Path
AI Algorithm Path
Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU MemoryLLM
0 likes · 6 min read
How Much GPU Memory Does an LLM Service Really Need?
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 31, 2024 · Artificial Intelligence

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Grouped Query AttentionMistralMixtral
0 likes · 17 min read
What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive
Baidu Geek Talk
Baidu Geek Talk
Dec 6, 2023 · Industry Insights

From MLOps to LMOps: Challenges and Solutions for Large‑Model Operations

This article reviews the evolution from MLOps to LMOps, outlines the core concepts, challenges, and key technologies such as large‑model inference optimization, prompt engineering, and context‑length extension, and offers a forward‑looking perspective on the future of AI operations.

AI OperationsLMOpsMLOps
0 likes · 23 min read
From MLOps to LMOps: Challenges and Solutions for Large‑Model Operations
Baidu Tech Salon
Baidu Tech Salon
Nov 10, 2023 · Artificial Intelligence

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

BaiduGPU OptimizationInference System
0 likes · 13 min read
Baidu Search Deep Learning Model Architecture and Optimization Practices
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
21CTO
21CTO
Mar 31, 2023 · Artificial Intelligence

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

ColossalChat, an open‑source project built on LLaMA, offers a full RLHF pipeline—including supervised fine‑tuning, reward‑model training, and reinforcement learning—enabling low‑cost, bilingual ChatGPT‑like models with 4‑bit quantized inference, detailed code, dataset, and performance optimizations.

AI InfrastructureColossalAIModel Quantization
0 likes · 12 min read
How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline
DataFunSummit
DataFunSummit
Nov 20, 2022 · Artificial Intelligence

NLP Technology Applications and Research in Voice Assistants

This article presents an in‑depth overview of NLP techniques used in voice assistants, covering the end‑to‑end conversational AI pipeline, intent and slot modeling, multi‑turn dialog management, model deployment pipelines, quantization methods, and self‑learning strategies for continuous improvement.

Conversational AIModel QuantizationNLP
0 likes · 30 min read
NLP Technology Applications and Research in Voice Assistants
Meituan Technology Team
Meituan Technology Team
Nov 17, 2022 · Artificial Intelligence

Overview of Recent Meituan Visual Intelligence Research Papers on Content Production, Distribution, and Model Quantization

Meituan’s Visual Intelligence team recently published eight top‑conference papers that advance weakly supervised segmentation, future‑aware captioning, panoptic narrative grounding, video‑text retrieval, open‑vocabulary detection, counterfactual image‑text matching, zero‑shot video classification, and efficient Vision‑Transformer quantization, all directly boosting real‑world content creation, distribution, and model efficiency.

AI researchImage CaptioningModel Quantization
0 likes · 19 min read
Overview of Recent Meituan Visual Intelligence Research Papers on Content Production, Distribution, and Model Quantization
Code DAO
Code DAO
May 5, 2022 · Artificial Intelligence

Optimizing Machine Learning Models for Edge Devices with TensorFlow Lite

This article explains how to convert a TensorFlow image‑classification model to TensorFlow Lite, apply different quantization techniques, benchmark the resulting models on a Raspberry Pi 4, and compare latency, size, and accuracy to demonstrate the trade‑offs of edge AI deployment.

EfficientNetModel QuantizationPython
0 likes · 16 min read
Optimizing Machine Learning Models for Edge Devices with TensorFlow Lite
DataFunTalk
DataFunTalk
Dec 19, 2019 · Artificial Intelligence

Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions

This article reviews neural‑network model quantization, explaining why quantization is needed, detailing forward‑ and backward‑propagation issues, presenting three main mitigation strategies, discussing subsequent pruning, performance‑recovery techniques, and outlining future research avenues in efficient machine learning.

Model QuantizationNeural Networksefficient machine learning
0 likes · 27 min read
Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 11, 2019 · Artificial Intelligence

How ACE Powers Edge AI: A Heterogeneous Compute Engine for Real‑Time Inference

This article explains the design of ACE (AI Labs Compute Engine), a heterogeneous edge compute platform that combines model quantization, GPU/DSP/VPU acceleration, cloud‑edge model management, and custom algorithm integration to enable low‑latency AI services such as gesture, pet, and pen‑tip detection on resource‑constrained devices.

AI inferenceEdge ComputingEmbedded AI
0 likes · 13 min read
How ACE Powers Edge AI: A Heterogeneous Compute Engine for Real‑Time Inference