Tagged articles
37 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDInference OptimizationKVCache
0 likes · 7 min read
Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?
58 Tech
58 Tech
Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersModel Serving
0 likes · 35 min read
How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference
Data Party THU
Data Party THU
Sep 30, 2025 · Backend Development

Ray Serve vs Celery: Which Is Best for GPU‑Intensive Parallel Workloads?

This article compares Ray Serve and Celery, explaining their design philosophies, scaling models, GPU‑aware scheduling, operational trade‑offs, and real‑world case studies to help engineers choose the right tool for high‑throughput online inference or large‑scale batch processing.

Distributed SystemsGPUModel Serving
0 likes · 9 min read
Ray Serve vs Celery: Which Is Best for GPU‑Intensive Parallel Workloads?
Code Wrench
Code Wrench
Sep 22, 2025 · Artificial Intelligence

Build a Private ChatGPT on Your Laptop with Ollama, DeepSeek‑R1 and Go MCP

This guide walks you through installing Ollama, pulling the open‑source DeepSeek‑R1:1.5B model, wrapping it with a Go‑based Model Context Protocol (MCP) server, creating a client example, and enhancing the experience with Open‑WebUI while offering performance‑tuning tips.

DeepSeekGoLocal AI
0 likes · 9 min read
Build a Private ChatGPT on Your Laptop with Ollama, DeepSeek‑R1 and Go MCP
IT Services Circle
IT Services Circle
Sep 16, 2025 · Artificial Intelligence

Why TensorFlow Is Dying and What the New AI Open‑Source Landscape Looks Like

An in‑depth analysis reveals TensorFlow’s rapid decline, the rise of PyTorch, and how Ant Group’s OpenRank‑driven “Large Model Open‑Source Ecosystem Panorama 2.0” maps shifting trends, from short‑term hype projects to performance‑focused AI infrastructure, highlighting the emerging US‑China dominance in AI open‑source development.

AI ecosystemAI open-sourceModel Serving
0 likes · 15 min read
Why TensorFlow Is Dying and What the New AI Open‑Source Landscape Looks Like
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

ComparisonLLMModel Serving
0 likes · 7 min read
Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?
Baidu Geek Talk
Baidu Geek Talk
Feb 10, 2025 · Artificial Intelligence

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Baidu Cloud's Qianfan platform launched DeepSeek‑R1 and DeepSeek‑V3 with ultra‑low inference pricing, leveraging advanced engine performance tweaks, a split Prefill/Decode architecture, and comprehensive security measures that together boost throughput, cut costs, and ensure enterprise‑grade reliability.

AI inferenceBaidu CloudLarge Language Models
0 likes · 5 min read
How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled
Sohu Tech Products
Sohu Tech Products
Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationKubernetes
0 likes · 16 min read
Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 10, 2024 · Artificial Intelligence

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

iQIYI’s minute‑level online deep‑learning framework overcomes stability, timeliness, compatibility, delayed feedback, catastrophic forgetting, and i.i.d. constraints through high‑availability pipelines, TensorFlow Example serialization, rapid P2P model distribution, flexible scheduling, disaster‑recovery rollbacks, PU‑loss adjustment, and knowledge‑distillation, delivering a 6.2% revenue boost.

AdvertisingCTR predictionDeep Learning
0 likes · 9 min read
Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions
DataFunSummit
DataFunSummit
Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUModel ServingTraining
0 likes · 18 min read
GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu
Meituan Technology Team
Meituan Technology Team
Jan 25, 2024 · Artificial Intelligence

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

Meituan’s Fulfillment Platform team built a high‑performance distributed causal‑forest framework—named Causal On Spark—that trains hundreds of trees on hundreds of millions of samples within minutes using MapReduce‑based histogram splitting, extensive memory optimizations, Parquet model serving, and novel distributed evaluation metrics, enabling scalable causal inference for pricing, subsidies, and marketing.

Model ServingSparkcausal forest
0 likes · 23 min read
Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform
Architecture & Thinking
Architecture & Thinking
Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringBatch ComputingHTAP storage
0 likes · 16 min read
How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering
DataFunTalk
DataFunTalk
Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationInferenceModel Serving
0 likes · 16 min read
GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario
Alibaba Cloud Native
Alibaba Cloud Native
Nov 22, 2023 · Cloud Native

Build a Sidecarless AI Application with Alibaba Cloud Service Mesh ASM – Step‑by‑Step Guide

This guide walks you through creating a sidecarless AI demo on Alibaba Cloud Service Mesh ASM, covering environment setup, multi‑model serving with KServe, PVC storage, InferenceService configuration, business service deployment, gateway and waypoint creation, traffic routing rules, and OIDC single sign‑on integration.

AIASMKServe
0 likes · 28 min read
Build a Sidecarless AI Application with Alibaba Cloud Service Mesh ASM – Step‑by‑Step Guide
Baidu Geek Talk
Baidu Geek Talk
Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP
0 likes · 18 min read
How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights
DataFunSummit
DataFunSummit
Oct 2, 2023 · Artificial Intelligence

WeChat NLP Algorithm Microservice Governance: Challenges and Solutions

This article examines the governance of WeChat's NLP algorithm microservices, outlining the management, performance, and scheduling challenges they face and presenting solutions such as automated CI/CD pipelines, dynamic scaling, DAG‑based service composition, a custom tracing system, the PyInter interpreter, and an improved load‑balancing algorithm.

MicroservicesModel ServingNLP
0 likes · 12 min read
WeChat NLP Algorithm Microservice Governance: Challenges and Solutions
Alibaba Cloud Native
Alibaba Cloud Native
Jun 23, 2023 · Cloud Native

Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid

This guide explains how to deploy large language models on Alibaba Cloud's ACK using KServe for serverless inference, integrates Fluid for distributed data caching to cut cold‑start latency, provides step‑by‑step commands, performance benchmarks, and practical tips for production‑grade AI model serving.

Cloud NativeFluidKServe
0 likes · 22 min read
Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid
Huolala Tech
Huolala Tech
Mar 23, 2023 · Cloud Native

How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes

Huolala’s Big Data Intelligent Platform team describes how they built a cloud‑native, one‑stop AI solution on Kubernetes, integrating Flink‑based feature engineering, a multi‑tenant Zeppelin notebook, GPU‑aware training, and a unified model‑serving platform, while addressing resource isolation, storage persistence, and cross‑cloud deployment.

AI PlatformCloud NativeGPU scheduling
0 likes · 17 min read
How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Mar 21, 2023 · Artificial Intelligence

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Xiaohongshu transformed its recommendation pipeline from daily to minute‑level updates by redesigning recall, ranking and feature‑joining components, deploying a base‑plus‑incremental training scheme, migrating Spark to Flink, rewriting services in C++, and optimizing RocksDB, which yielded over 10% longer dwell time, 15% more interactions and roughly 50% higher new‑note efficiency.

Model ServingReal-time Traininglarge-scale systems
0 likes · 20 min read
From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu
JD Tech Talk
JD Tech Talk
Nov 24, 2022 · Artificial Intelligence

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

AIMicroservicesModel Serving
0 likes · 11 min read
Design and Implementation of an Online Inference Service for Risk‑Control Algorithms
Snowball Engineer Team
Snowball Engineer Team
Apr 11, 2022 · Artificial Intelligence

Design and Implementation of Snowball's Model Feature Management Platform

The article presents Snowball's model feature platform, detailing its motivation, architecture, feature lifecycle management, online engine design, optimization techniques, and the resulting improvements in feature iteration speed, reuse, and system stability for recommendation and search services.

Feature ManagementModel Servingfeature engineering
0 likes · 16 min read
Design and Implementation of Snowball's Model Feature Management Platform
YunZhu Net Technology Team
YunZhu Net Technology Team
Oct 22, 2021 · Artificial Intelligence

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

This article reviews deep learning and AI frameworks, highlights challenges of online model serving, and presents Avior—a lightweight, distributed inference engine designed for high‑performance AI services, detailing its architecture, layer design, benchmark results, and future development plans.

AI frameworksAviorDeep Learning
0 likes · 8 min read
Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior
Baidu Geek Talk
Baidu Geek Talk
Aug 16, 2021 · Artificial Intelligence

Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment

The article demonstrates how to integrate Paddle Serving’s new security‑gateway feature with the open‑source Kong API gateway and its Konga UI, using Docker‑Compose to create a secure, HTTPS‑encrypted, header‑authenticated AI model serving endpoint that hides internal services while supporting high‑concurrency inference.

AIDockerKong
0 likes · 9 min read
Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunSummit
DataFunSummit
Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

Distributed TrainingGPU clusterModel Serving
0 likes · 18 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
DataFunTalk
DataFunTalk
Mar 1, 2021 · Artificial Intelligence

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

FlinkModel ServingOnline Learning
0 likes · 9 min read
Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink
JD Tech Talk
JD Tech Talk
Dec 18, 2020 · Artificial Intelligence

Model Online Inference System: Architecture, Components, and Deployment Strategies

This article examines the challenges of moving machine‑learning models from offline training to online serving, proposes a modular architecture—including model gateway, data source gateway, business service center, monitoring, and RPC components—to enable rapid model deployment, version management, traffic mirroring, gray‑release, and real‑time monitoring.

Model Servingmachine learningmonitoring
0 likes · 10 min read
Model Online Inference System: Architecture, Components, and Deployment Strategies
JD Tech Talk
JD Tech Talk
Sep 17, 2020 · Artificial Intelligence

Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System

This article presents a comprehensive technical solution for a distributed online inference system that packages machine‑learning models in Docker containers, orchestrates them with Kubernetes for fault‑tolerant, elastic scaling, and integrates model repositories, image registries, monitoring, and automated model selection to streamline deployment, updates, and resource management.

AIDockerKubernetes
0 likes · 15 min read
Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System
DataFunTalk
DataFunTalk
Aug 27, 2020 · Artificial Intelligence

Model Serving in Real-Time: Insights from Alibaba’s User Interest Center

This article explains Alibaba’s User Interest Center approach to real‑time model serving, detailing how it separates offline sequence modeling from lightweight online inference, uses an online interest‑embedding store, and dramatically reduces latency for recommendation models such as DIEN and MIMN.

AlibabaEmbeddingModel Serving
0 likes · 8 min read
Model Serving in Real-Time: Insights from Alibaba’s User Interest Center
Meituan Technology Team
Meituan Technology Team
Jul 16, 2020 · Artificial Intelligence

Augur: An Online Model Inference Framework and Poker Platform for Meituan Search

Meituan’s AI‑driven search combines the Augur online inference framework—offering stateless, distributed feature operators, transformers, and a DSL for rapid, high‑throughput model scoring—with the Poker platform for model training, versioning, and experimentation, together accelerating iteration, improving performance, and enabling advanced model‑as‑feature ensembles.

AI PlatformModel Servingfeature engineering
0 likes · 26 min read
Augur: An Online Model Inference Framework and Poker Platform for Meituan Search
Youzan Coder
Youzan Coder
Jun 17, 2020 · Artificial Intelligence

Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan

Sunfish is Youzan’s integrated AI platform that unifies visual drag‑and‑drop model training, notebook‑based algorithm development, automated model management and one‑click publishing with a low‑latency, high‑availability “small‑box” inference service, enabling end‑to‑end deep‑learning workflows from data exploration to online recommendation and risk‑control deployment.

AI PlatformMLOpsModel Serving
0 likes · 17 min read
Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan
58 Tech
58 Tech
Mar 27, 2020 · Artificial Intelligence

dl_inference: Open‑Source General Deep Learning Inference Service

dl_inference is an open‑source inference platform that simplifies deployment of TensorFlow and PyTorch models in production, offering unified gRPC access, load‑balanced multi‑node serving, GPU/CPU options, customizable pre‑ and post‑processing, and extensible architecture for future AI workloads.

AI inferenceDeep LearningModel Serving
0 likes · 11 min read
dl_inference: Open‑Source General Deep Learning Inference Service
DataFunTalk
DataFunTalk
Mar 12, 2020 · Artificial Intelligence

Model Evolution and Optimization for Recommendation Systems in a Mid‑size E‑commerce App

This article describes the end‑to‑end recommendation pipeline of the Province Money Fast Report app, covering business background, data collection, model training and evaluation, the evolution from FM to DeepFM, DIN, DCN, xDeepFM, ESMM and custom networks, as well as serving strategies and practical lessons learned.

CTR predictionDeep LearningModel Serving
0 likes · 28 min read
Model Evolution and Optimization for Recommendation Systems in a Mid‑size E‑commerce App
DataFunTalk
DataFunTalk
Aug 14, 2018 · Artificial Intelligence

Machine Learning and Deep Learning Engineering Practices at Ping An Life

The article summarizes senior AI expert Wu Jianjun’s presentation on machine‑learning and deep‑learning engineering at Ping An Life, detailing the company’s big‑data platform, data processing pipelines, model training frameworks, distributed computing strategies, and production model‑serving architecture for financial applications.

Deep LearningModel Servingdistributed computing
0 likes · 15 min read
Machine Learning and Deep Learning Engineering Practices at Ping An Life
Ctrip Technology
Ctrip Technology
Jun 11, 2018 · Artificial Intelligence

Ctrip Model Engine Platform: An Integrated End‑to‑End Service for Real‑Time AI Model Deployment

The article introduces Ctrip’s Model Engine Platform, a comprehensive system that streamlines feature preparation, engineering, model management, and product orchestration to enable fast, reliable real‑time AI model serving across various business scenarios, while addressing common challenges such as manual data handling, offline‑only prediction, and long development cycles.

AI PlatformCtripFeature Management
0 likes · 15 min read
Ctrip Model Engine Platform: An Integrated End‑to‑End Service for Real‑Time AI Model Deployment