Tagged articles

model serving

38 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDInference OptimizationKVCache

0 likes · 7 min read

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

58 Tech

Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersMulti-LoRA

0 likes · 35 min read

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

Data Party THU

Sep 30, 2025 · Backend Development

Ray Serve vs Celery: Which Is Best for GPU‑Intensive Parallel Workloads?

This article compares Ray Serve and Celery, explaining their design philosophies, scaling models, GPU‑aware scheduling, operational trade‑offs, and real‑world case studies to help engineers choose the right tool for high‑throughput online inference or large‑scale batch processing.

GPURay ServeTask Queue

0 likes · 9 min read

Ray Serve vs Celery: Which Is Best for GPU‑Intensive Parallel Workloads?

Code Wrench

Sep 22, 2025 · Artificial Intelligence

Build a Private ChatGPT on Your Laptop with Ollama, DeepSeek‑R1 and Go MCP

This guide walks you through installing Ollama, pulling the open‑source DeepSeek‑R1:1.5B model, wrapping it with a Go‑based Model Context Protocol (MCP) server, creating a client example, and enhancing the experience with Open‑WebUI while offering performance‑tuning tips.

DeepSeekGoMCP

0 likes · 9 min read

Build a Private ChatGPT on Your Laptop with Ollama, DeepSeek‑R1 and Go MCP

IT Services Circle

Sep 16, 2025 · Artificial Intelligence

Why TensorFlow Is Dying and What the New AI Open‑Source Landscape Looks Like

An in‑depth analysis reveals TensorFlow’s rapid decline, the rise of PyTorch, and how Ant Group’s OpenRank‑driven “Large Model Open‑Source Ecosystem Panorama 2.0” maps shifting trends, from short‑term hype projects to performance‑focused AI infrastructure, highlighting the emerging US‑China dominance in AI open‑source development.

AI EcosystemAI open-sourceOpenRank

0 likes · 15 min read

Why TensorFlow Is Dying and What the New AI Open‑Source Landscape Looks Like

Architect's Alchemy Furnace

Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

ComparisonLLMOpen-source

0 likes · 7 min read

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

Baidu Geek Talk

Feb 10, 2025 · Artificial Intelligence

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Baidu Cloud's Qianfan platform launched DeepSeek‑R1 and DeepSeek‑V3 with ultra‑low inference pricing, leveraging advanced engine performance tweaks, a split Prefill/Decode architecture, and comprehensive security measures that together boost throughput, cut costs, and ensure enterprise‑grade reliability.

AI inferenceBaidu CloudLarge Language Models

0 likes · 5 min read

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Alibaba Cloud Infrastructure

Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

DeepSeekGPUKServe

0 likes · 13 min read

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

Sohu Tech Products

Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationKubernetes

0 likes · 16 min read

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

iQIYI Technical Product Team

Oct 10, 2024 · Artificial Intelligence

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

iQIYI’s minute‑level online deep‑learning framework overcomes stability, timeliness, compatibility, delayed feedback, catastrophic forgetting, and i.i.d. constraints through high‑availability pipelines, TensorFlow Example serialization, rapid P2P model distribution, flexible scheduling, disaster‑recovery rollbacks, PU‑loss adjustment, and knowledge‑distillation, delivering a 6.2% revenue boost.

AdvertisingCTR PredictionDeep Learning

0 likes · 9 min read

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

DataFunSummit

Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUOptimizationTraining

0 likes · 18 min read

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

Meituan Technology Team

Jan 25, 2024 · Artificial Intelligence

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

Meituan’s Fulfillment Platform team built a high‑performance distributed causal‑forest framework—named Causal On Spark—that trains hundreds of trees on hundreds of millions of samples within minutes using MapReduce‑based histogram splitting, extensive memory optimizations, Parquet model serving, and novel distributed evaluation metrics, enabling scalable causal inference for pricing, subsidies, and marketing.

Sparkcausal forestcausal inference

0 likes · 23 min read

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

Architecture & Thinking

Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringBatch ComputingHTAP storage

0 likes · 16 min read

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

DataFunTalk

Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationTrainingXiaohongshu

0 likes · 16 min read

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

Alibaba Cloud Native

Nov 22, 2023 · Cloud Native

Build a Sidecarless AI Application with Alibaba Cloud Service Mesh ASM – Step‑by‑Step Guide

This guide walks you through creating a sidecarless AI demo on Alibaba Cloud Service Mesh ASM, covering environment setup, multi‑model serving with KServe, PVC storage, InferenceService configuration, business service deployment, gateway and waypoint creation, traffic routing rules, and OIDC single sign‑on integration.

AIASMKServe

0 likes · 28 min read

Build a Sidecarless AI Application with Alibaba Cloud Service Mesh ASM – Step‑by‑Step Guide

Baidu Geek Talk

Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP

0 likes · 18 min read

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

DataFunSummit

Oct 2, 2023 · Artificial Intelligence

WeChat NLP Algorithm Microservice Governance: Challenges and Solutions

This article examines the governance of WeChat's NLP algorithm microservices, outlining the management, performance, and scheduling challenges they face and presenting solutions such as automated CI/CD pipelines, dynamic scaling, DAG‑based service composition, a custom tracing system, the PyInter interpreter, and an improved load‑balancing algorithm.

CI/CDMicroservicesNLP

0 likes · 12 min read

WeChat NLP Algorithm Microservice Governance: Challenges and Solutions

Alibaba Cloud Native

Jun 23, 2023 · Cloud Native

Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid

This guide explains how to deploy large language models on Alibaba Cloud's ACK using KServe for serverless inference, integrates Fluid for distributed data caching to cut cold‑start latency, provides step‑by‑step commands, performance benchmarks, and practical tips for production‑grade AI model serving.

Cloud NativeFluidKServe

0 likes · 22 min read

Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid

Huolala Tech

Mar 23, 2023 · Cloud Native

How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes

Huolala’s Big Data Intelligent Platform team describes how they built a cloud‑native, one‑stop AI solution on Kubernetes, integrating Flink‑based feature engineering, a multi‑tenant Zeppelin notebook, GPU‑aware training, and a unified model‑serving platform, while addressing resource isolation, storage persistence, and cross‑cloud deployment.

AI platformCloud NativeGPU scheduling

0 likes · 17 min read

How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes

Xiaohongshu Tech REDtech

Mar 21, 2023 · Artificial Intelligence

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Xiaohongshu transformed its recommendation pipeline from daily to minute‑level updates by redesigning recall, ranking and feature‑joining components, deploying a base‑plus‑incremental training scheme, migrating Spark to Flink, rewriting services in C++, and optimizing RocksDB, which yielded over 10% longer dwell time, 15% more interactions and roughly 50% higher new‑note efficiency.

Real-time Traininglarge-scale systemsmodel serving

0 likes · 20 min read

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Smart Era Software Development

Mar 16, 2023 · Artificial Intelligence

10 Essential Elements of Machine Learning System Architecture

The article outlines ten core components—data and feature pipelines, feature store, training and retraining pipelines, metadata store, serving infrastructure, production monitoring, reusable ML pipelines, workflow orchestration, CI/CT/CD, and end‑to‑end quality control—that together form a scalable, reliable architecture for modern machine‑learning systems.

MLOpsfeature engineeringmachine learning

0 likes · 7 min read

10 Essential Elements of Machine Learning System Architecture

JD Tech Talk

Nov 24, 2022 · Artificial Intelligence

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

AIMicroservicesPerformance Optimization

0 likes · 11 min read

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

Snowball Engineer Team

Apr 11, 2022 · Artificial Intelligence

Design and Implementation of Snowball's Model Feature Management Platform

The article presents Snowball's model feature platform, detailing its motivation, architecture, feature lifecycle management, online engine design, optimization techniques, and the resulting improvements in feature iteration speed, reuse, and system stability for recommendation and search services.

Feature Managementfeature engineeringmachine learning

0 likes · 16 min read

Design and Implementation of Snowball's Model Feature Management Platform

YunZhu Net Technology Team

Oct 22, 2021 · Artificial Intelligence

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

This article reviews deep learning and AI frameworks, highlights challenges of online model serving, and presents Avior—a lightweight, distributed inference engine designed for high‑performance AI services, detailing its architecture, layer design, benchmark results, and future development plans.

AI frameworksAviorDeep Learning

0 likes · 8 min read

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

Baidu Geek Talk

Aug 16, 2021 · Artificial Intelligence

Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment

The article demonstrates how to integrate Paddle Serving’s new security‑gateway feature with the open‑source Kong API gateway and its Konga UI, using Docker‑Compose to create a secure, HTTPS‑encrypted, header‑authenticated AI model serving endpoint that hides internal services while supporting high‑concurrency inference.

AIAPI GatewayDocker

0 likes · 9 min read

Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment

DataFunTalk

Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureGPU clusteringMultimodal AI

0 likes · 17 min read

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

DataFunSummit

Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

GPU ClusterMultimodal AIReal-time inference

0 likes · 18 min read

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

DataFunTalk

Mar 1, 2021 · Artificial Intelligence

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

Flinkfeature engineeringmodel serving

0 likes · 9 min read

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

JD Tech Talk

Dec 18, 2020 · Artificial Intelligence

Model Online Inference System: Architecture, Components, and Deployment Strategies

This article examines the challenges of moving machine‑learning models from offline training to online serving, proposes a modular architecture—including model gateway, data source gateway, business service center, monitoring, and RPC components—to enable rapid model deployment, version management, traffic mirroring, gray‑release, and real‑time monitoring.

machine learningmodel servingmonitoring

0 likes · 10 min read

Model Online Inference System: Architecture, Components, and Deployment Strategies

JD Tech Talk

Sep 17, 2020 · Artificial Intelligence

Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System

This article presents a comprehensive technical solution for a distributed online inference system that packages machine‑learning models in Docker containers, orchestrates them with Kubernetes for fault‑tolerant, elastic scaling, and integrates model repositories, image registries, monitoring, and automated model selection to streamline deployment, updates, and resource management.

AIDockerKubernetes

0 likes · 15 min read

Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System

DataFunTalk

Aug 27, 2020 · Artificial Intelligence

Model Serving in Real-Time: Insights from Alibaba’s User Interest Center

This article explains Alibaba’s User Interest Center approach to real‑time model serving, detailing how it separates offline sequence modeling from lightweight online inference, uses an online interest‑embedding store, and dramatically reduces latency for recommendation models such as DIEN and MIMN.

AlibabaEmbeddingReal-time inference

0 likes · 8 min read

Model Serving in Real-Time: Insights from Alibaba’s User Interest Center

Meituan Technology Team

Jul 16, 2020 · Artificial Intelligence

Augur: An Online Model Inference Framework and Poker Platform for Meituan Search

Meituan’s AI‑driven search combines the Augur online inference framework—offering stateless, distributed feature operators, transformers, and a DSL for rapid, high‑throughput model scoring—with the Poker platform for model training, versioning, and experimentation, together accelerating iteration, improving performance, and enabling advanced model‑as‑feature ensembles.

AI platformSearch Enginefeature engineering

0 likes · 26 min read

Augur: An Online Model Inference Framework and Poker Platform for Meituan Search

Youzan Coder

Jun 17, 2020 · Artificial Intelligence

Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan

Sunfish is Youzan’s integrated AI platform that unifies visual drag‑and‑drop model training, notebook‑based algorithm development, automated model management and one‑click publishing with a low‑latency, high‑availability “small‑box” inference service, enabling end‑to‑end deep‑learning workflows from data exploration to online recommendation and risk‑control deployment.

AI platformMLOpsModel Training

0 likes · 17 min read

Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan

58 Tech

Mar 27, 2020 · Artificial Intelligence

dl_inference: Open‑Source General Deep Learning Inference Service

dl_inference is an open‑source inference platform that simplifies deployment of TensorFlow and PyTorch models in production, offering unified gRPC access, load‑balanced multi‑node serving, GPU/CPU options, customizable pre‑ and post‑processing, and extensible architecture for future AI workloads.

AI inferenceDeep LearningOpen-source

0 likes · 11 min read

dl_inference: Open‑Source General Deep Learning Inference Service

DataFunTalk

Mar 12, 2020 · Artificial Intelligence

Model Evolution and Optimization for Recommendation Systems in a Mid‑size E‑commerce App

This article describes the end‑to‑end recommendation pipeline of the Province Money Fast Report app, covering business background, data collection, model training and evaluation, the evolution from FM to DeepFM, DIN, DCN, xDeepFM, ESMM and custom networks, as well as serving strategies and practical lessons learned.

CTR PredictionDeep Learningfeature engineering

0 likes · 28 min read

Model Evolution and Optimization for Recommendation Systems in a Mid‑size E‑commerce App

DataFunTalk

Aug 14, 2018 · Artificial Intelligence

Machine Learning and Deep Learning Engineering Practices at Ping An Life

The article summarizes senior AI expert Wu Jianjun’s presentation on machine‑learning and deep‑learning engineering at Ping An Life, detailing the company’s big‑data platform, data processing pipelines, model training frameworks, distributed computing strategies, and production model‑serving architecture for financial applications.

Deep LearningDistributed Computingmodel serving

0 likes · 15 min read

Machine Learning and Deep Learning Engineering Practices at Ping An Life

Ctrip Technology

Jun 11, 2018 · Artificial Intelligence

Ctrip Model Engine Platform: An Integrated End‑to‑End Service for Real‑Time AI Model Deployment

The article introduces Ctrip’s Model Engine Platform, a comprehensive system that streamlines feature preparation, engineering, model management, and product orchestration to enable fast, reliable real‑time AI model serving across various business scenarios, while addressing common challenges such as manual data handling, offline‑only prediction, and long development cycles.

AI platformCtripFeature Management

0 likes · 15 min read

Ctrip Model Engine Platform: An Integrated End‑to‑End Service for Real‑Time AI Model Deployment