AI-Native Cloud — Curated Series · 99 articles

Collection size

99 articles

Page 4 of 5

Dec 14, 2020 · Artificial Intelligence

LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models

This article introduces LightSeq, an open‑source, GPU‑accelerated inference engine that dramatically speeds up Transformer‑based models such as BERT and GPT by up to 14× over TensorFlow, supports multiple decoding strategies, integrates seamlessly with major deep‑learning frameworks, and provides detailed performance benchmarks and technical optimizations.

GPUInferenceLightSeq

0 likes · 15 min read

LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models

Alibaba Cloud Native

Jan 17, 2024 · Artificial Intelligence

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.

Cloud Native AIIn‑Flight BatchingLLM inference

0 likes · 13 min read

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Mar 17, 2025 · Cloud Native

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

ACK GatewayAI ExtensionKubernetes

0 likes · 16 min read

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Feb 13, 2025 · Cloud Computing

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.

ACK OneACS GPUDeepSeek

0 likes · 14 min read

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

Old Zhang's AI Learning

Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM inference

0 likes · 5 min read

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

Old Meng AI Explorer

Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationPython

0 likes · 10 min read

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

Volcano Engine Developer Services

Feb 10, 2025 · Artificial Intelligence

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

This article explains how to deploy DeepSeek's open‑source large language models—especially DeepSeek‑R1‑Distill—on Volcengine Cloud using three approaches: a containerized VKE solution, a serverless veFaaS setup, and a one‑click Terraform script, complete with step‑by‑step instructions, code snippets, and configuration tips.

DeepSeekTerraformVolcengine

0 likes · 18 min read

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

InferenceLLMMLX

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

Volcano Engine Developer Services

Jun 30, 2023 · Cloud Native

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks you through preparing a VKE cluster, pulling the Langchain‑ChatGLM container image, creating the necessary Deployment and Service resources, and adding a local knowledge base, enabling you to run a Langchain‑based ChatGLM service with GPU support on Volcengine’s cloud‑native platform.

AI DeploymentChatGLMGPU

0 likes · 6 min read

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

AttentionGPUPerformance

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

JD Tech Talk

Feb 10, 2025 · Artificial Intelligence

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

This guide walks you through preparing a JD Cloud GPU instance, installing NVIDIA drivers, deploying Ollama, running the DeepSeek LLM (including model download and execution), configuring the Chatbox graphical client for interactive queries, and optionally feeding local documents into AnythingLLM for a private knowledge base.

AnythingLLMChatboxDeepSeek

0 likes · 17 min read

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

Raymond Ops

Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU

0 likes · 15 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

Alibaba Cloud Native

Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek

0 likes · 15 min read

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

Old Zhang's AI Learning

Mar 4, 2026 · Artificial Intelligence

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.

LM StudioOllamaQwen3.5

0 likes · 4 min read

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

Alibaba Cloud Developer

Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI infrastructureKVCacheKubernetes

0 likes · 21 min read

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Baidu Intelligent Cloud Tech Hub

Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

InferenceKunlunLLM

0 likes · 8 min read

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

Baobao Algorithm Notes

Oct 15, 2023 · Artificial Intelligence

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

GPU memory optimizationPyTorchmeta device

0 likes · 12 min read

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

Old Meng AI Explorer

Jan 10, 2026 · Artificial Intelligence

Run Large Language Models on a Laptop: How ktransformers Breaks the GPU Barrier

ktransformers is an open‑source AI model optimization framework that uses dynamic quantization, layer fusion and memory reuse to cut memory usage by up to 50%, double loading speed and reduce inference cost, enabling 7B‑13B models to run smoothly on ordinary CPUs or low‑end GPUs.

KTransformersModel OptimizationPython

0 likes · 11 min read

Run Large Language Models on a Laptop: How ktransformers Breaks the GPU Barrier

Architect's Alchemy Furnace

Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

ComparisonLLMLocal Deployment

0 likes · 7 min read

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?