Tagged articles

LLM deployment

24 articles · Page 1 of 1

Jun 17, 2026 · Artificial Intelligence

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

This article explains how INT8, INT4, bitsandbytes, GPTQ, and AWQ quantization methods can dramatically cut memory usage, boost inference speed, and lower costs for large language models, while detailing their trade‑offs, practical workflows, benchmark results, and common pitfalls to help engineers decide which technique best fits their production scenario.

AWQGPTQINT4

0 likes · 22 min read

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

AI Engineer Programming

Jun 8, 2026 · Artificial Intelligence

When to Use Small Models: A System Design Perspective

Small models are chosen based on deployment constraints rather than absolute parameter counts; the article outlines how resource limits, latency, cost, privacy, and task characteristics define their suitability, compares their strengths and weaknesses to large models, and offers system‑level design patterns for effective use.

Inference OptimizationLLM deploymentRAG

0 likes · 20 min read

When to Use Small Models: A System Design Perspective

Old Zhang's AI Learning

Apr 20, 2026 · Artificial Intelligence

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The article details how to deploy the 4‑bit quantized Qwen3.6-35B model with vLLM 0.17 (and 0.19.1 patch) on a Docker container, compares its memory usage and token‑generation speed to Qwen3.5‑35B, and shares practical scripts and observed performance of roughly 150 tokens per second.

DockerLLM deploymentPerformance Benchmark

0 likes · 5 min read

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

Old Zhang's AI Learning

Mar 9, 2026 · Artificial Intelligence

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.

4-bit quantizationDockerFP8

0 likes · 7 min read

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

DataFunTalk

Feb 26, 2026 · Artificial Intelligence

How RAG Can Overcome Large‑Model Pitfalls in Enterprise Knowledge Work

This article explains the challenges large language models face in real‑world applications, introduces Retrieval‑Augmented Generation (RAG) as a solution, and details a modular RAG architecture, its components, and practical techniques for document parsing, query rewriting, hybrid retrieval, ranking, and answer generation in an enterprise setting.

Document ParsingLLM deploymentRAG

0 likes · 22 min read

How RAG Can Overcome Large‑Model Pitfalls in Enterprise Knowledge Work

Old Zhang's AI Learning

Feb 17, 2026 · Artificial Intelligence

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

This article explains how to run the 397B Qwen3.5 model on a Mac by using Unsloth Dynamic 2.0 quantization (2‑bit, 3‑bit, or 4‑bit), outlines hardware requirements, provides compilation and download commands for llama.cpp, shows how to launch inference in thinking and non‑thinking modes, and compares several deployment options such as llama‑server, Transformers, SGLang/vLLM, and MLX.

Dynamic QuantizationGGUFLLM deployment

0 likes · 14 min read

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

Old Zhang's AI Learning

Jan 29, 2026 · Artificial Intelligence

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

The article reviews the newly released quantized versions of the Kimi K2.5 large language model, detailing hardware needs, recommended quantization levels, deployment steps on Apple MLX and Inferencer, performance numbers, and the model's hybrid thinking mode.

InferencerKimi K2.5LLM deployment

0 likes · 5 min read

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

Raymond Ops

Aug 26, 2025 · Artificial Intelligence

How to Deploy DeepSeek R1 Locally: Versions, Hardware, and UI Tools

This guide explains DeepSeek R1’s model variants, hardware requirements, local installation steps using Ollama, LM Studio or Docker, and how to add visual interfaces like Open‑WebUI and Dify for a complete on‑premise AI solution.

DeepSeekDifyHardware Requirements

0 likes · 14 min read

How to Deploy DeepSeek R1 Locally: Versions, Hardware, and UI Tools

Alibaba Cloud Infrastructure

Apr 30, 2025 · Cloud Native

Deploying Qwen3-8B Large Language Model on Alibaba Cloud ACK with ACS GPU Acceleration

This guide explains how to prepare, deploy, and verify the Qwen3‑8B large language model on an Alibaba Cloud Container Service for Kubernetes (ACK) cluster using ACS GPU resources, covering prerequisites, model download, storage setup, Kubernetes manifests, and testing the inference service.

ACKACSCloud Native

0 likes · 8 min read

Deploying Qwen3-8B Large Language Model on Alibaba Cloud ACK with ACS GPU Acceleration

Architects' Tech Alliance

Apr 13, 2025 · Artificial Intelligence

Deploying DeepSeek LLMs On-Premises: Step‑by‑Step Guide and Hardware Sizing

This article provides a comprehensive technical guide for privately deploying DeepSeek large language models, covering model and runtime parameter selection, hardware sizing calculations, software stack preparation, inference service setup, performance tuning, and security monitoring considerations.

AI hardware sizingDeepSeekInference Optimization

0 likes · 14 min read

Deploying DeepSeek LLMs On-Premises: Step‑by‑Step Guide and Hardware Sizing

Qborfy AI

Mar 27, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 Locally with Ollama and Dify: A Step‑by‑Step Guide

This article walks through the entire process of deploying the DeepSeek‑R1 large language model on a personal machine, covering hardware requirements, Ollama installation, model download, service startup, remote access configuration, and visual UI integration with Dify, complete with concrete commands and screenshots.

AIDeepSeekDocker

0 likes · 9 min read

How to Deploy DeepSeek‑R1 Locally with Ollama and Dify: A Step‑by‑Step Guide

AIWalker

Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU OptimizationLLM deployment

0 likes · 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

Data Thinking Notes

Feb 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.

AI model optimizationDeepSeekDynamic Quantization

0 likes · 14 min read

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

21CTO

Feb 16, 2025 · Artificial Intelligence

How to Deploy Your Own DeepSeek LLM Locally: Step-by-Step Guide

This guide walks you through setting up a local DeepSeek large language model, covering environment preparation, model acquisition, dependency installation, FastAPI service creation, Docker containerization, optional front‑end interface, performance tuning, and common troubleshooting steps.

AI modelDeepSeekDocker

0 likes · 7 min read

How to Deploy Your Own DeepSeek LLM Locally: Step-by-Step Guide

JD Cloud Developers

Feb 12, 2025 · Artificial Intelligence

Deploy a Private DeepSeek Large‑Model on JD Cloud with Ollama

This guide walks you through the reasons for deploying a private DeepSeek large‑model, compares full and distilled versions, shows how to purchase a JD Cloud computer, install Ollama, run the model, and integrate a local knowledge base using CherryStudio, Page Assist, and Anything LLM.

AI modelDeepSeekJD Cloud

0 likes · 17 min read

Deploy a Private DeepSeek Large‑Model on JD Cloud with Ollama

Huawei Cloud Developer Alliance

Feb 11, 2025 · Artificial Intelligence

Deploy DeepSeek LLM on Huawei Cloud: Step‑by‑Step Guide with Ollama, ChatBox & CodeArts

This tutorial walks developers through installing Ollama on a Huawei cloud VM, deploying the DeepSeek large language model, integrating it with the ChatBox web UI, and finally connecting the model to CodeArts IDE via the Continue plugin for seamless AI‑assisted coding.

ChatboxCodeArts IDEDeepSeek

0 likes · 5 min read

Deploy DeepSeek LLM on Huawei Cloud: Step‑by‑Step Guide with Ollama, ChatBox & CodeArts

JD Tech Talk

Feb 10, 2025 · Artificial Intelligence

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

This guide walks you through preparing a JD Cloud GPU instance, installing NVIDIA drivers, deploying Ollama, running the DeepSeek LLM (including model download and execution), configuring the Chatbox graphical client for interactive queries, and optionally feeding local documents into AnythingLLM for a private knowledge base.

AnythingLLMChatboxDeepSeek

0 likes · 17 min read

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

Architecture Digest

Feb 6, 2025 · Artificial Intelligence

Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

This guide explains how to deploy the full 671B DeepSeek R1 model on local hardware using Ollama, leveraging dynamic quantization to shrink model size, detailing hardware requirements, step‑by‑step installation, configuration, performance observations, and practical recommendations.

DeepSeekDynamic QuantizationGPU

0 likes · 12 min read

Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

Software Engineering 3.0 Era

Feb 4, 2025 · Cloud Computing

Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases

Facing frequent overloads on DeepSeek's official service, this guide details how to run DeepSeek locally with Ollama, deploy it on major cloud platforms such as Huawei, Alibaba, Tencent, Baidu and ZStack, integrate it into enterprise private clusters, leverage open‑source tools like HuggingFace, vLLM and Dify, and showcases real‑world applications in finance, education, and cross‑domain testing.

DeepSeekEnterprise AILLM deployment

0 likes · 10 min read

Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases

JavaEdge

Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling

0 likes · 24 min read

7 Proven Strategies to Simplify Large Language Model Deployment

JavaEdge

Oct 14, 2024 · Artificial Intelligence

Deploying LLMs with LangServe: A Complete Guide from Setup to Client Calls

This article introduces LangServe, explains its key features for LLM deployment, walks through environment setup, shows how to build a FastAPI‑based REST API with code examples, demonstrates testing via Postman and remote client calls, and summarizes its benefits for AI model serving.

AI model servingFastAPILLM deployment

0 likes · 9 min read

Deploying LLMs with LangServe: A Complete Guide from Setup to Client Calls

Architect's Alchemy Furnace

Jul 3, 2024 · Artificial Intelligence

Deploy ChatGLM3‑6B with FastGPT, One‑API, and M3E on Linux

This guide walks you through deploying the ChatGLM3‑6B large language model locally, adding the M3E vector embedding model, setting up One‑API and FastGPT with Docker, configuring environments, fine‑tuning with LoRA, and testing the integrated knowledge‑base Q&A system.

ChatGLM3DockerFastGPT

0 likes · 15 min read

Deploy ChatGLM3‑6B with FastGPT, One‑API, and M3E on Linux

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentLarge Language ModelsPython

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

DataFunTalk

Jan 4, 2024 · Artificial Intelligence

Using OpenLLM to Quickly Build and Deploy Large Language Model Applications

This presentation explains how OpenLLM, an open‑source LLM framework, together with BentoML, addresses the challenges of deploying large language models by offering model switching, memory optimizations, multi‑GPU support, observability, and easy containerized deployment for production AI applications.

BentoMLLLM deploymentOpenLLM

0 likes · 18 min read

Using OpenLLM to Quickly Build and Deploy Large Language Model Applications