Tag

Online Inference

0 views collected around this technical thread.

JD Tech Talk
JD Tech Talk
Feb 20, 2025 · Artificial Intelligence

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

The document describes the evolution, design principles, key technologies, online inference workflow, evaluation methods, and sample‑generation techniques of a large‑language‑model‑based multi‑agent system that powers a 24/7 e‑commerce merchant assistant, highlighting its benefits, challenges, and future work.

AI planningLLMOnline Inference
0 likes · 21 min read
Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation
Tencent Advertising Technology
Tencent Advertising Technology
Jan 9, 2025 · Artificial Intelligence

Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations

This report details how large language models (LLMs) were integrated into Tencent's search advertising pipeline—from early extraction‑distillation experiments in 2023 to a 2024 end‑to‑end generative recall architecture—showing significant improvements in relevance, diversity, and revenue through knowledge injection, supervised fine‑tuning, constrained beam‑search decoding, and high‑performance inference services.

AIKnowledge InjectionLLM
0 likes · 11 min read
Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations
JD Retail Technology
JD Retail Technology
Jan 25, 2024 · Artificial Intelligence

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.

GPU AccelerationOnline InferenceRecommendation systems
0 likes · 13 min read
Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration
Bilibili Tech
Bilibili Tech
Dec 29, 2023 · Artificial Intelligence

Performance Optimization of Bilibili's Online Inference Service for the Effect Advertising Engine

To cope with soaring traffic on Bilibili’s effect‑advertising engine, the team systematically measured latency, eliminated redundant Redis calls, switched JSON to Protobuf, applied branch‑prediction hints, loop‑unrolling and AVX256 SIMD, introduced object‑pooling and an inverted‑index request format, cutting CPU usage by 21 % and boosting peak throughput 13 %.

C++Memory ManagementOnline Inference
0 likes · 21 min read
Performance Optimization of Bilibili's Online Inference Service for the Effect Advertising Engine
DataFunTalk
DataFunTalk
Dec 13, 2022 · Artificial Intelligence

End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute

This article demonstrates how to build a complete end-to-end machine-learning workflow for taxi trip duration prediction by integrating OpenMLDB with Alibaba Cloud MaxCompute’s serverless services, covering environment setup, offline data ingestion, feature extraction, model training, deployment, and real-time online inference within 20 ms.

Feature StoreMaxComputeOnline Inference
0 likes · 13 min read
End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute
JD Tech Talk
JD Tech Talk
Nov 24, 2022 · Artificial Intelligence

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

AIOnline InferencePerformance Optimization
0 likes · 11 min read
Design and Implementation of an Online Inference Service for Risk‑Control Algorithms
NetEase LeiHuo UX Big Data Technology
NetEase LeiHuo UX Big Data Technology
Jul 14, 2022 · Artificial Intelligence

Evolution of Real‑Time Game Recommendation System at NetEase Leihuo

The article reviews the development of NetEase Leihuo's game recommendation system, covering the shift from offline batch recommendation to real‑time feature engineering and online inference, detailing architecture design, practical experiences, performance optimizations, and future directions such as real‑time training.

AIOnline InferencePerformance Optimization
0 likes · 8 min read
Evolution of Real‑Time Game Recommendation System at NetEase Leihuo
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
May 19, 2022 · Artificial Intelligence

Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture

Evaluating the Cloud Music online estimation system on NUMA‑based servers revealed that CPU pinning across both memory nodes dramatically boosts throughput on high‑end 96‑core machines—up to 75% for complex models—while low‑end servers gain only modestly, confirming NUMA‑aware scheduling’s critical role for CPU‑intensive inference workloads.

CPU architectureNUMAOnline Inference
0 likes · 8 min read
Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture
Alimama Tech
Alimama Tech
Dec 8, 2021 · Artificial Intelligence

Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall

The Dual Vector Foil (DVF) system decouples index construction from model training by building a post‑training HNSW graph, enabling any complex model to score candidates, which yields a 5.7 % recall boost, cuts latency from ~40 ms to 6.5 ms, and raises QPS over tenfold while simplifying maintenance.

IndexingOnline Inferencedeep learning
0 likes · 27 min read
Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
360 Tech Engineering
360 Tech Engineering
Mar 1, 2021 · Artificial Intelligence

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

BERTGPUOnline Inference
0 likes · 10 min read
Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search
iQIYI Technical Product Team
iQIYI Technical Product Team
Feb 26, 2021 · Artificial Intelligence

Optimization of Coarse Ranking Models for Short‑Video Recommendation at iQIYI

iQIYI’s short‑video recommendation team replaced a GBDT coarse‑ranking model with a lightweight dual‑tower DNN, applied knowledge distillation, sparse‑aware embedding optimization, and inference merging, then introduced a cascade MMOE architecture, achieving comparable accuracy with half the memory, ~19 ms latency reduction, and measurable gains in watch time, CTR and engagement.

Online Inferencecascade modelcoarse ranking
0 likes · 15 min read
Optimization of Coarse Ranking Models for Short‑Video Recommendation at iQIYI
JD Tech Talk
JD Tech Talk
Dec 18, 2020 · Artificial Intelligence

Model Online Inference System: Architecture, Components, and Deployment Strategies

This article examines the challenges of moving machine‑learning models from offline training to online serving, proposes a modular architecture—including model gateway, data source gateway, business service center, monitoring, and RPC components—to enable rapid model deployment, version management, traffic mirroring, gray‑release, and real‑time monitoring.

DeploymentOnline Inferencemachine learning
0 likes · 10 min read
Model Online Inference System: Architecture, Components, and Deployment Strategies
TAL Education Technology
TAL Education Technology
Dec 17, 2020 · Artificial Intelligence

Web Front‑End Intelligent Computing: Concepts, Implementation, and Applications

This article explains how AI technologies are transitioning from labs to the web, covering neural network fundamentals, the distinction between cloud and edge intelligence, implementation pipelines, offline model optimization, online inference backends like WebGL and WASM, and practical web front‑end AI use cases.

Online InferenceWeb AIfrontend
0 likes · 10 min read
Web Front‑End Intelligent Computing: Concepts, Implementation, and Applications
Ctrip Technology
Ctrip Technology
Aug 13, 2020 · Artificial Intelligence

Hotel Recommendation System Architecture, Models, and Evaluation at Ctrip

This article presents a comprehensive overview of Ctrip's hotel recommendation system, covering its technical architecture, data processing pipelines, various ranking and embedding models—including FM, Wide&Deep, DeepFM, and FTRL—deployment methods such as PMML and TensorFlow Serving, offline and online evaluation results, and challenges like cold‑start and diversity.

CtripOnline Inferencedeep learning
0 likes · 24 min read
Hotel Recommendation System Architecture, Models, and Evaluation at Ctrip
58 Tech
58 Tech
Dec 20, 2019 · Artificial Intelligence

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

The article presents a comprehensive overview of 58.com’s AI platform built on Kubernetes, detailing its layered architecture, resource scheduling, offline training pipelines, debugging environment, distributed TensorFlow/PyTorch training, performance benchmarks, and online inference services, highlighting how the system empowers various business units with scalable AI capabilities.

AI PlatformKubernetesOnline Inference
0 likes · 11 min read
Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference
DataFunTalk
DataFunTalk
Oct 11, 2019 · Artificial Intelligence

Building an End-to-End Federated Learning Pipeline Production Service with FATE-Flow

This article explains how to construct a high‑elastic, high‑performance end‑to‑end federated learning pipeline—including task scheduling, visual modeling, model management, version control, and online inference—using the FATE‑Flow platform to move from experimental ML to production deployment.

AIFATE-FlowFederated Learning
0 likes · 14 min read
Building an End-to-End Federated Learning Pipeline Production Service with FATE-Flow