Collection size
99 articles
Page 5 of 5
James' Growth Diary
James' Growth Diary
May 14, 2026 · Artificial Intelligence

LLM Semantic Routing Explained: Model‑Based Intent Classification and Three Keyword‑Matching Pitfalls

This article breaks down LLM semantic routing as a classifier, compares keyword, embedding, and LLM‑based routes, provides full TypeScript implementations, introduces hybrid routing for speed and accuracy, and covers production‑grade observability and dynamic configuration to avoid common pitfalls.

Hybrid RoutingLLMLangChain
0 likes · 33 min read
LLM Semantic Routing Explained: Model‑Based Intent Classification and Three Keyword‑Matching Pitfalls
Old Zhang's AI Learning
Old Zhang's AI Learning
May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20KV cacheTransformers
0 likes · 12 min read
vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM inference
0 likes · 5 min read
Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance
James' Growth Diary
James' Growth Diary
Apr 13, 2026 · Frontend Development

Local Inference & Edge AI: Why Front‑End AI Is the Next Battlefield

Edge AI runs AI models directly in browsers or devices, offering zero latency, zero API cost, and full privacy, and the article explains the three technical breakthroughs that make it possible, compares WebLLM, Transformers.js and Ollama, and provides a hybrid architecture with concrete engineering challenges and solutions that can cut total AI costs by 40‑55% for typical front‑end applications.

FrontendOllamaTransformers.js
0 likes · 20 min read
Local Inference & Edge AI: Why Front‑End AI Is the Next Battlefield
Old Meng AI Explorer
Old Meng AI Explorer
Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationOpen-source
0 likes · 10 min read
How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU
Volcano Engine Developer Services
Volcano Engine Developer Services
Feb 10, 2025 · Artificial Intelligence

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

This article explains how to deploy DeepSeek's open‑source large language models—especially DeepSeek‑R1‑Distill—on Volcengine Cloud using three approaches: a containerized VKE solution, a serverless veFaaS setup, and a one‑click Terraform script, complete with step‑by‑step instructions, code snippets, and configuration tips.

DeepSeekTerraformVolcengine
0 likes · 18 min read
How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 30, 2023 · Cloud Native

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks you through preparing a VKE cluster, pulling the Langchain‑ChatGLM container image, creating the necessary Deployment and Service resources, and adding a local knowledge base, enabling you to run a Langchain‑based ChatGLM service with GPU support on Volcengine’s cloud‑native platform.

AI deploymentChatGLMGPU
0 likes · 6 min read
Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide
DataFunSummit
DataFunSummit
Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUVirtual Memoryattention
0 likes · 25 min read
Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
JD Tech Talk
JD Tech Talk
Feb 10, 2025 · Artificial Intelligence

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

This guide walks you through preparing a JD Cloud GPU instance, installing NVIDIA drivers, deploying Ollama, running the DeepSeek LLM (including model download and execution), configuring the Chatbox graphical client for interactive queries, and optionally feeding local documents into AnythingLLM for a private knowledge base.

AnythingLLMChatboxDeepSeek
0 likes · 17 min read
Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox
Raymond Ops
Raymond Ops
Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU
0 likes · 15 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
Alibaba Cloud Native
Alibaba Cloud Native
Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek
0 likes · 15 min read
Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI InfrastructureKVCacheLLM inference
0 likes · 21 min read
Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

KunlunLLMOpen-source
0 likes · 8 min read
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 15, 2023 · Artificial Intelligence

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

GPU memory optimizationPyTorchmeta device
0 likes · 12 min read
Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

ComparisonLLMOpen-source
0 likes · 7 min read
Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?