Tag

model serving

0 views collected around this technical thread.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

Alibaba CloudDeepSeekGPU
0 likes · 13 min read
Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 10, 2024 · Artificial Intelligence

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

iQIYI’s minute‑level online deep‑learning framework overcomes stability, timeliness, compatibility, delayed feedback, catastrophic forgetting, and i.i.d. constraints through high‑availability pipelines, TensorFlow Example serialization, rapid P2P model distribution, flexible scheduling, disaster‑recovery rollbacks, PU‑loss adjustment, and knowledge‑distillation, delivering a 6.2% revenue boost.

CTR predictionadvertisingdeep learning
0 likes · 9 min read
Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions
DataFunSummit
DataFunSummit
Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUInferenceOptimization
0 likes · 18 min read
GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu
Architecture & Thinking
Architecture & Thinking
Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringHTAP storageResource Scheduling
0 likes · 16 min read
How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering
DataFunTalk
DataFunTalk
Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU optimizationInferenceTraining
0 likes · 16 min read
GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario
DataFunTalk
DataFunTalk
Oct 19, 2023 · Artificial Intelligence

Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas

This article presents Nine Chapters Cloud Extreme DataCanvas's insights and practices on multimodal large model platforms, covering their historical development, platform components such as AI Foundation Software and Prompt Manager, practical implementations like memory-augmented models and ETL pipelines, and future prospects for enterprise knowledge bases and agents.

AI PlatformKnowledge BaseLarge Models
0 likes · 13 min read
Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas
DataFunSummit
DataFunSummit
Oct 2, 2023 · Artificial Intelligence

WeChat NLP Algorithm Microservice Governance: Challenges and Solutions

This article examines the governance of WeChat's NLP algorithm microservices, outlining the management, performance, and scheduling challenges they face and presenting solutions such as automated CI/CD pipelines, dynamic scaling, DAG‑based service composition, a custom tracing system, the PyInter interpreter, and an improved load‑balancing algorithm.

CI/CDLoad BalancingMicroservices
0 likes · 12 min read
WeChat NLP Algorithm Microservice Governance: Challenges and Solutions
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Mar 21, 2023 · Artificial Intelligence

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Xiaohongshu transformed its recommendation pipeline from daily to minute‑level updates by redesigning recall, ranking and feature‑joining components, deploying a base‑plus‑incremental training scheme, migrating Spark to Flink, rewriting services in C++, and optimizing RocksDB, which yielded over 10% longer dwell time, 15% more interactions and roughly 50% higher new‑note efficiency.

Vector Searchlarge-scale systemsmachine learning
0 likes · 20 min read
From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu
JD Tech Talk
JD Tech Talk
Nov 24, 2022 · Artificial Intelligence

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

AIMicroservicesOnline Inference
0 likes · 11 min read
Design and Implementation of an Online Inference Service for Risk‑Control Algorithms
Snowball Engineer Team
Snowball Engineer Team
Apr 11, 2022 · Artificial Intelligence

Design and Implementation of Snowball's Model Feature Management Platform

The article presents Snowball's model feature platform, detailing its motivation, architecture, feature lifecycle management, online engine design, optimization techniques, and the resulting improvements in feature iteration speed, reuse, and system stability for recommendation and search services.

Feature Engineeringfeature managementmachine learning
0 likes · 16 min read
Design and Implementation of Snowball's Model Feature Management Platform
YunZhu Net Technology Team
YunZhu Net Technology Team
Oct 22, 2021 · Artificial Intelligence

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

This article reviews deep learning and AI frameworks, highlights challenges of online model serving, and presents Avior—a lightweight, distributed inference engine designed for high‑performance AI services, detailing its architecture, layer design, benchmark results, and future development plans.

AI frameworksAviorDistributed Inference
0 likes · 8 min read
Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 24, 2021 · Artificial Intelligence

Memory Leak Diagnosis and Fixes for TensorFlow Serving in iQIYI’s Deep Learning Platform

The iQIYI deep‑learning platform identified two TensorFlow Serving memory‑leak problems—a string‑accumulating executor map caused by unordered input maps and an uncontrolled gRPC thread surge under heavy load—submitted upstream patches that sort inputs and cap thread counts, eliminating OOM crashes and stabilizing production.

AI infrastructureMemory LeakPerformance Optimization
0 likes · 10 min read
Memory Leak Diagnosis and Fixes for TensorFlow Serving in iQIYI’s Deep Learning Platform
Baidu Geek Talk
Baidu Geek Talk
Aug 16, 2021 · Artificial Intelligence

Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment

The article demonstrates how to integrate Paddle Serving’s new security‑gateway feature with the open‑source Kong API gateway and its Konga UI, using Docker‑Compose to create a secure, HTTPS‑encrypted, header‑authenticated AI model serving endpoint that hides internal services while supporting high‑concurrency inference.

AIAPI GatewayDocker
0 likes · 9 min read
Integrating Paddle Serving with Kong Security Gateway for AI Model Deployment
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI infrastructureGPU clusteringWeibo
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunSummit
DataFunSummit
Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

GPU ClusterWeibodistributed training
0 likes · 18 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
DataFunTalk
DataFunTalk
Mar 1, 2021 · Artificial Intelligence

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

Feature EngineeringFlinkmodel serving
0 likes · 9 min read
Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink
DataFunSummit
DataFunSummit
Feb 4, 2021 · Artificial Intelligence

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

Resource Schedulingbig data integrationdata pipelines
0 likes · 15 min read
Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details
JD Tech Talk
JD Tech Talk
Dec 18, 2020 · Artificial Intelligence

Model Online Inference System: Architecture, Components, and Deployment Strategies

This article examines the challenges of moving machine‑learning models from offline training to online serving, proposes a modular architecture—including model gateway, data source gateway, business service center, monitoring, and RPC components—to enable rapid model deployment, version management, traffic mirroring, gray‑release, and real‑time monitoring.

DeploymentOnline Inferencemachine learning
0 likes · 10 min read
Model Online Inference System: Architecture, Components, and Deployment Strategies
JD Tech Talk
JD Tech Talk
Sep 17, 2020 · Artificial Intelligence

Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System

This article presents a comprehensive technical solution for a distributed online inference system that packages machine‑learning models in Docker containers, orchestrates them with Kubernetes for fault‑tolerant, elastic scaling, and integrates model repositories, image registries, monitoring, and automated model selection to streamline deployment, updates, and resource management.

AIDockerKubernetes
0 likes · 15 min read
Design and Implementation of a High‑Availability Distributed Machine Learning Model Online Inference System