Tagged articles

34 articles

Page 1 of 1

May 12, 2026 · Artificial Intelligence

How to Deploy Trained Neural Networks on Arduino and Raspberry Pi

Deploying large AI models to tiny embedded devices like Arduino and Raspberry Pi requires aggressive model slimming through quantization, pruning, and distillation, careful selection of runtimes such as TensorFlow Lite, and addressing power, latency, and debugging challenges to achieve real‑time inference.

ArduinoEmbedded AIModel Pruning

0 likes · 7 min read

How to Deploy Trained Neural Networks on Arduino and Raspberry Pi

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

Why OpenClaw Is So Expensive and How QuantClaw Cuts Cost by 21% While Boosting Speed 15%

OpenClaw’s high token consumption drives steep costs, but the QuantClaw plug‑in dynamically routes tasks to 4‑bit, 8‑bit or 16‑bit model instances based on a systematic quantization study, achieving up to 21% cost reduction, 15% latency improvement, and even modest accuracy gains.

AI agentsCost reductionDynamic Precision Routing

0 likes · 9 min read

Why OpenClaw Is So Expensive and How QuantClaw Cuts Cost by 21% While Boosting Speed 15%

Machine Heart

May 9, 2026 · Artificial Intelligence

Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?

QuantClaw, an open‑source plug‑in for the OpenClaw AI agent framework, uses a systematic quantization study to dynamically route tasks to appropriate model precisions, achieving up to 21% cost reduction, 8‑15% latency improvement, and even higher task scores across diverse workloads.

AI agentsCost OptimizationModel Quantization

0 likes · 8 min read

Can QuantClaw Cut OpenClaw Costs by 21% and Speed Up Inference by 15%?

DaTaobao Tech

Apr 22, 2026 · Artificial Intelligence

How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds

MNN‑Sana‑Edit‑V2, a collaborative effort between Taobao’s Meta team and Hangzhou University, combines a frozen Qwen3‑0.6B LLM, Learnable Query, Connector, Linear DiT and Deep Compression Autoencoder with 4/8‑bit quantization to run fully on mobile devices, delivering 512×512 comic‑style conversions in about 15 seconds—2.5× faster than cloud alternatives—while providing open‑source code, detailed training stages, and extensive performance benchmarks.

Image GenerationMobile AIModel Quantization

0 likes · 13 min read

How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds

Black & White Path

Apr 8, 2026 · Artificial Intelligence

Run Massive AI Models on a Single PC: The 1‑Bit LLM Revolution

Microsoft’s open‑source bitnet.cpp transforms 100‑billion‑parameter LLM inference from GPU‑only to ordinary CPUs by replacing floating‑point matrix multiplication with integer add‑subtract, cutting energy use by 82 %, memory by 90 % and delivering up to 6× speed on x86/ARM hardware.

1-bit LLMBitNetCPU inference

0 likes · 7 min read

Run Massive AI Models on a Single PC: The 1‑Bit LLM Revolution

DataFunTalk

Apr 7, 2026 · Artificial Intelligence

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

In a four‑hour competition, algorithm engineer Zhang Zhen from a Chinese EV company detailed his end‑to‑end workflow for quantizing the massive Qwen3‑Next‑80B model, covering sensitive‑layer analysis, iterative smoothing, fallback strategies, and parallel "horse‑race" debugging that led his team to win the GeekDay challenge.

Iterative SmoothLarge Language ModelsModel Quantization

0 likes · 9 min read

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

Lao Guo's Learning Space

Apr 4, 2026 · Artificial Intelligence

Which Mac Studio Config Can Run the Largest AI Models? A One-Table Guide

The article explains how Apple’s updated 2025 Mac Studio, with its unified memory architecture and high bandwidth, determines the size of AI models it can run, compares M4 Max and M3 Ultra configurations, maps memory to model parameters, and recommends setups for various use cases.

Large Language ModelsM3 UltraM4 Max

0 likes · 8 min read

Which Mac Studio Config Can Run the Largest AI Models? A One-Table Guide

Old Zhang's AI Learning

Mar 16, 2026 · Artificial Intelligence

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.

Claude OpusGGUFLM Studio

0 likes · 12 min read

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

macrozheng

Mar 16, 2026 · Artificial Intelligence

How LLMFit Automates Hardware Compatibility Checks for Local Large‑Model Deployment

LLMFit, a Rust‑based terminal tool, automatically detects system hardware, recommends optimal quantization levels, and scores models across multiple dimensions, enabling developers to quickly identify and run large language models that suit their machines without trial‑and‑error.

CLI toolLLMModel Quantization

0 likes · 5 min read

How LLMFit Automates Hardware Compatibility Checks for Local Large‑Model Deployment

AI Engineering

Mar 11, 2026 · Artificial Intelligence

Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

This guide shows how to replace Anthropic's API by running a local Qwen 3.5 model with llama.cpp, configuring Claude Code via ANTHROPIC_BASE_URL, and includes hardware checks, build steps, model download, server launch, speed‑fix tips, and usage instructions for secure, cost‑free development.

Anthropic APIClaude CodeGPU Acceleration

0 likes · 8 min read

Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

MaGe Linux Operations

Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU MemoryKV cacheLLM OOM

0 likes · 28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

Baidu Intelligent Cloud Tech Hub

Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceHardware accelerationINT4

0 likes · 16 min read

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Huolala Tech

Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation

0 likes · 18 min read

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

Fun with Large Models

Jan 18, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

This article walks through two mainstream local deployment solutions—high‑performance VLLM for production Linux servers and lightweight Ollama for personal Windows machines—covering environment setup, model download, server launch, API testing, key configuration parameters, and the quantization technique that makes Ollama models compact.

GPU OptimizationLarge Language ModelsModel Quantization

0 likes · 18 min read

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

Data Party THU

Nov 21, 2025 · Artificial Intelligence

Unlocking 2025 Multi-Agent AI: Core Tech, Frameworks, and Emerging Trends

This article analyzes the technical foundations, development frameworks, real‑time inference optimizations, typical industry deployments, and future research directions of multi‑agent systems in 2025, highlighting protocols like FIPA‑ACL and MCP, tools such as LangGraph and ADP3.0, and edge‑computing breakthroughs.

AI ArchitectureModel Quantizationdistributed computing

0 likes · 16 min read

Unlocking 2025 Multi-Agent AI: Core Tech, Frameworks, and Emerging Trends

Huawei Cloud Developer Alliance

Oct 13, 2025 · Artificial Intelligence

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

This article explains the principles, key methods, and practical effects of model quantization, pruning, and knowledge distillation, comparing their advantages and disadvantages, and showing how combining these techniques enables compact, high‑performance AI models on resource‑constrained devices.

Model PruningModel Quantizationedge AI

0 likes · 7 min read

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

JD Cloud Developers

Sep 11, 2025 · Artificial Intelligence

How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators

This article explains why migrating AI applications from Nvidia GPUs to domestic Chinese accelerators is urgent, outlines the technical challenges, and presents JD Cloud's JoyScale zero‑perception migration stack with hardware, software, model, and inference optimizations for real‑world scenarios.

AI migrationJoyScaleModel Quantization

0 likes · 10 min read

How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators

Data Thinking Notes

Jul 6, 2025 · Artificial Intelligence

How Quantization Shrinks Giant AI Models for Edge Devices

This article explains why quantizing massive AI models is essential for deploying them on resource‑constrained devices, outlines core quantization concepts, techniques, and methods, compares their pros and cons, and presents practical application scenarios such as smartphones, autonomous driving, IoT, and edge computing.

AI deploymentLarge Language ModelsModel Quantization

0 likes · 9 min read

How Quantization Shrinks Giant AI Models for Edge Devices

Instant Consumer Technology Team

Jul 3, 2025 · Artificial Intelligence

Why Buying an AI Appliance Is a Strategic Pitfall for Enterprises

Enterprises rushing to purchase DeepSeek AI appliances and smart‑agent platforms often face hidden technical, data, and organizational challenges that turn promised "plug‑and‑play" solutions into costly missteps, highlighting the need for realistic strategy, robust data governance, and continuous capability building.

AI capability buildingAI deploymentData Governance

0 likes · 28 min read

Why Buying an AI Appliance Is a Strategic Pitfall for Enterprises

JavaEdge

Jun 27, 2025 · Artificial Intelligence

Why Inference Engines Are Essential for Deploying Large Language Models in Production

The article explains what inference engines are, why they are needed beyond raw Python scripts, and outlines best practices such as model quantization, batching, and parallelism, while comparing popular open‑source and commercial options for production AI workloads.

AI deploymentBatchingInference Engine

0 likes · 14 min read

Why Inference Engines Are Essential for Deploying Large Language Models in Production

Architect

Apr 21, 2025 · Artificial Intelligence

Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption

Microsoft Research introduced BitNet b1.58 2B4T, a native 1‑bit large language model with 2 billion parameters trained on 4 trillion tokens, achieving only 0.4 GB non‑embedding memory, 0.028 J decoding energy, and 29 ms CPU latency while matching full‑precision performance.

1-bit LLMAI researchBitNet

0 likes · 7 min read

Microsoft Research Releases BitNet b1.58 2B4T: A 1‑Bit Native Large Language Model with Ultra‑Low Memory and Energy Consumption

DaTaobao Tech

Apr 21, 2025 · Artificial Intelligence

How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop

Facing DeepSeek R1 server instability, the open‑source MNN LLM framework offers local, mobile‑friendly deployment with model quantization and hardware‑specific optimizations, dramatically improving inference speed, stability, and download reliability across Android, iOS, and desktop platforms while supporting multimodal inputs.

AndroidLLMMNN

0 likes · 11 min read

How MNN LLM Delivers Fast, Stable On‑Device LLM Inference for Android, iOS, and Desktop

Architect's Alchemy Furnace

Mar 31, 2025 · Artificial Intelligence

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

An in‑depth technical analysis compares popular model quantization schemes—q4_0, q5_K_M, and q8_0—detailing their precision trade‑offs, memory savings, inference speed, hardware compatibility, and ideal use‑cases, complemented by performance benchmarks on Llama‑3‑8B and practical selection guidelines.

AI OptimizationLLM PerformanceModel Quantization

0 likes · 7 min read

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

AI Algorithm Path

Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU MemoryLLM

0 likes · 6 min read

How Much GPU Memory Does an LLM Service Really Need?

Baobao Algorithm Notes

Jul 31, 2024 · Artificial Intelligence

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Grouped Query AttentionMistralMixtral

0 likes · 17 min read

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

Baidu Geek Talk

Dec 6, 2023 · Industry Insights

From MLOps to LMOps: Challenges and Solutions for Large‑Model Operations

This article reviews the evolution from MLOps to LMOps, outlines the core concepts, challenges, and key technologies such as large‑model inference optimization, prompt engineering, and context‑length extension, and offers a forward‑looking perspective on the future of AI operations.

AI OperationsLMOpsMLOps

0 likes · 23 min read

From MLOps to LMOps: Challenges and Solutions for Large‑Model Operations

Baidu Tech Salon

Nov 10, 2023 · Artificial Intelligence

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

BaiduGPU OptimizationInference System

0 likes · 13 min read

Baidu Search Deep Learning Model Architecture and Optimization Practices

High Availability Architecture

Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference

0 likes · 10 min read

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

21CTO

Mar 31, 2023 · Artificial Intelligence

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

ColossalChat, an open‑source project built on LLaMA, offers a full RLHF pipeline—including supervised fine‑tuning, reward‑model training, and reinforcement learning—enabling low‑cost, bilingual ChatGPT‑like models with 4‑bit quantized inference, detailed code, dataset, and performance optimizations.

AI InfrastructureColossalAIModel Quantization

0 likes · 12 min read

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

DataFunSummit

Nov 20, 2022 · Artificial Intelligence

NLP Technology Applications and Research in Voice Assistants

This article presents an in‑depth overview of NLP techniques used in voice assistants, covering the end‑to‑end conversational AI pipeline, intent and slot modeling, multi‑turn dialog management, model deployment pipelines, quantization methods, and self‑learning strategies for continuous improvement.

Conversational AIModel QuantizationNLP

0 likes · 30 min read

NLP Technology Applications and Research in Voice Assistants

Meituan Technology Team

Nov 17, 2022 · Artificial Intelligence

Overview of Recent Meituan Visual Intelligence Research Papers on Content Production, Distribution, and Model Quantization

Meituan’s Visual Intelligence team recently published eight top‑conference papers that advance weakly supervised segmentation, future‑aware captioning, panoptic narrative grounding, video‑text retrieval, open‑vocabulary detection, counterfactual image‑text matching, zero‑shot video classification, and efficient Vision‑Transformer quantization, all directly boosting real‑world content creation, distribution, and model efficiency.

AI researchImage CaptioningModel Quantization

0 likes · 19 min read

Overview of Recent Meituan Visual Intelligence Research Papers on Content Production, Distribution, and Model Quantization

Code DAO

May 5, 2022 · Artificial Intelligence

Optimizing Machine Learning Models for Edge Devices with TensorFlow Lite

This article explains how to convert a TensorFlow image‑classification model to TensorFlow Lite, apply different quantization techniques, benchmark the resulting models on a Raspberry Pi 4, and compare latency, size, and accuracy to demonstrate the trade‑offs of edge AI deployment.

EfficientNetModel QuantizationPython

0 likes · 16 min read

Optimizing Machine Learning Models for Edge Devices with TensorFlow Lite

DataFunTalk

Dec 19, 2019 · Artificial Intelligence

Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions

This article reviews neural‑network model quantization, explaining why quantization is needed, detailing forward‑ and backward‑propagation issues, presenting three main mitigation strategies, discussing subsequent pruning, performance‑recovery techniques, and outlining future research avenues in efficient machine learning.

Model QuantizationNeural Networksefficient machine learning

0 likes · 27 min read

Model Quantization in Neural Networks: Challenges, Solutions, and Future Directions

Alibaba Cloud Developer

Jun 11, 2019 · Artificial Intelligence

How ACE Powers Edge AI: A Heterogeneous Compute Engine for Real‑Time Inference

This article explains the design of ACE (AI Labs Compute Engine), a heterogeneous edge compute platform that combines model quantization, GPU/DSP/VPU acceleration, cloud‑edge model management, and custom algorithm integration to enable low‑latency AI services such as gesture, pet, and pen‑tip detection on resource‑constrained devices.

AI inferenceEdge ComputingEmbedded AI

0 likes · 13 min read

How ACE Powers Edge AI: A Heterogeneous Compute Engine for Real‑Time Inference