Tagged articles

online inference

22 articles · Page 1 of 1

Aug 7, 2025 · Artificial Intelligence

Balancing Personalization and Safety: LaD Model Boosts Real‑Time Search Query Generation

This article presents the LaD model, an end‑to‑end generative and detoxification framework for search query auto‑completion that hierarchically captures long‑ and short‑term user interests, achieves large AB test gains on Kuaishou, and has been accepted at KDD 2025.

KDD2025LaDdetoxification

0 likes · 11 min read

Balancing Personalization and Safety: LaD Model Boosts Real‑Time Search Query Generation

Kuaishou Tech

Jul 17, 2025 · Artificial Intelligence

How DHPS Boosted Online Inference Throughput by 270% with RDMA

This article details the design and evolution of DHPS, Kuaishou's load‑balanced, RDMA‑based high‑performance service architecture, explaining its network, storage, and traffic‑scheduling innovations that deliver over 270% query‑throughput improvement, lower latency, reduced CPU usage, and near‑five‑nine availability for large‑scale AI inference workloads.

RDMAStorage Enginedistributed systems

0 likes · 17 min read

How DHPS Boosted Online Inference Throughput by 270% with RDMA

JD Tech Talk

Feb 20, 2025 · Artificial Intelligence

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

The document describes the evolution, design principles, key technologies, online inference workflow, evaluation methods, and sample‑generation techniques of a large‑language‑model‑based multi‑agent system that powers a 24/7 e‑commerce merchant assistant, highlighting its benefits, challenges, and future work.

AI planningLLMReward Model

0 likes · 21 min read

Multi‑Agent Architecture for an E‑Commerce Business Assistant: Design, Planning, Evaluation, and Sample Generation

JD Cloud Developers

Jan 14, 2025 · Artificial Intelligence

How Generative Recommendation Systems Transform E‑Commerce with LLMs

This article explains how large language models reshape recommendation systems by simplifying pipelines, integrating world knowledge, and leveraging scaling laws, and details the engineering steps for deploying generative recall models—including product encoding, user prompting, model training, TensorRT‑LLM optimization, and continuous performance improvements.

LLMRecommendation SystemsTensorRT-LLM

0 likes · 13 min read

How Generative Recommendation Systems Transform E‑Commerce with LLMs

Tencent Advertising Technology

Jan 9, 2025 · Artificial Intelligence

Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations

This report details how large language models (LLMs) were integrated into Tencent's search advertising pipeline—from early extraction‑distillation experiments in 2023 to a 2024 end‑to‑end generative recall architecture—showing significant improvements in relevance, diversity, and revenue through knowledge injection, supervised fine‑tuning, constrained beam‑search decoding, and high‑performance inference services.

AIBeam SearchLLM

0 likes · 11 min read

Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations

Meituan Technology Team

Mar 28, 2024 · Artificial Intelligence

Large-Scale Heterogeneous Graph Modeling and GraphET Engine for Meituan Food Delivery Search Advertising

The paper describes how Meituan’s food‑delivery search advertising uses a heterogeneous billion‑node graph and the GraphET engine to boost weak‑supply recall, detailing a progression from fine‑grained modeling to GPT‑enhanced pre‑training, and presenting a scalable training and low‑latency inference architecture that handles hundreds of billions of edges.

GraphETLarge-Scale GraphMeituan

0 likes · 27 min read

Large-Scale Heterogeneous Graph Modeling and GraphET Engine for Meituan Food Delivery Search Advertising

JD Retail Technology

Jan 25, 2024 · Artificial Intelligence

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.

Deep Learning CompilerDistributed ComputingGPU Acceleration

0 likes · 13 min read

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

Bilibili Tech

Dec 29, 2023 · Artificial Intelligence

Performance Optimization of Bilibili's Online Inference Service for the Effect Advertising Engine

To cope with soaring traffic on Bilibili’s effect‑advertising engine, the team systematically measured latency, eliminated redundant Redis calls, switched JSON to Protobuf, applied branch‑prediction hints, loop‑unrolling and AVX256 SIMD, introduced object‑pooling and an inverted‑index request format, cutting CPU usage by 21 % and boosting peak throughput 13 %.

C#CPUMemory Management

0 likes · 21 min read

Performance Optimization of Bilibili's Online Inference Service for the Effect Advertising Engine

DataFunTalk

Dec 13, 2022 · Artificial Intelligence

End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute

This article demonstrates how to build a complete end-to-end machine-learning workflow for taxi trip duration prediction by integrating OpenMLDB with Alibaba Cloud MaxCompute’s serverless services, covering environment setup, offline data ingestion, feature extraction, model training, deployment, and real-time online inference within 20 ms.

Feature StoreMaxComputeOpenMLDB

0 likes · 13 min read

End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute

JD Tech Talk

Nov 24, 2022 · Artificial Intelligence

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

AIMicroservicesPerformance Optimization

0 likes · 11 min read

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

NetEase LeiHuo UX Big Data Technology

Jul 14, 2022 · Artificial Intelligence

Evolution of Real‑Time Game Recommendation System at NetEase Leihuo

The article reviews the development of NetEase Leihuo's game recommendation system, covering the shift from offline batch recommendation to real‑time feature engineering and online inference, detailing architecture design, practical experiences, performance optimizations, and future directions such as real‑time training.

AIPerformance Optimizationgame industry

0 likes · 8 min read

Evolution of Real‑Time Game Recommendation System at NetEase Leihuo

NetEase Cloud Music Tech Team

May 19, 2022 · Artificial Intelligence

Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture

Evaluating the Cloud Music online estimation system on NUMA‑based servers revealed that CPU pinning across both memory nodes dramatically boosts throughput on high‑end 96‑core machines—up to 75% for complex models—while low‑end servers gain only modestly, confirming NUMA‑aware scheduling’s critical role for CPU‑intensive inference workloads.

CPU architecturenumaonline inference

0 likes · 8 min read

Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture

Alimama Tech

Dec 8, 2021 · Artificial Intelligence

Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall

The Dual Vector Foil (DVF) system decouples index construction from model training by building a post‑training HNSW graph, enabling any complex model to score candidates, which yields a 5.7 % recall boost, cuts latency from ~40 ms to 6.5 ms, and raises QPS over tenfold while simplifying maintenance.

Indexingdual vector foillarge-scale retrieval

0 likes · 27 min read

Dual Vector Foil (DVF): Decoupled Index and Model Retrieval for Large-Scale Recall

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

360 Tech Engineering

Mar 1, 2021 · Artificial Intelligence

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

BERTGPUKnowledge Distillation

0 likes · 10 min read

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

iQIYI Technical Product Team

Feb 26, 2021 · Artificial Intelligence

Optimization of Coarse Ranking Models for Short‑Video Recommendation at iQIYI

iQIYI’s short‑video recommendation team replaced a GBDT coarse‑ranking model with a lightweight dual‑tower DNN, applied knowledge distillation, sparse‑aware embedding optimization, and inference merging, then introduced a cascade MMOE architecture, achieving comparable accuracy with half the memory, ~19 ms latency reduction, and measurable gains in watch time, CTR and engagement.

Knowledge Distillationcascade modelcoarse ranking

0 likes · 15 min read

Optimization of Coarse Ranking Models for Short‑Video Recommendation at iQIYI

JD Tech Talk

Dec 18, 2020 · Artificial Intelligence

Model Online Inference System: Architecture, Components, and Deployment Strategies

This article examines the challenges of moving machine‑learning models from offline training to online serving, proposes a modular architecture—including model gateway, data source gateway, business service center, monitoring, and RPC components—to enable rapid model deployment, version management, traffic mirroring, gray‑release, and real‑time monitoring.

Monitoringmachine learningmodel serving

0 likes · 10 min read

Model Online Inference System: Architecture, Components, and Deployment Strategies

TAL Education Technology

Dec 17, 2020 · Artificial Intelligence

Web Front‑End Intelligent Computing: Concepts, Implementation, and Applications

This article explains how AI technologies are transitioning from labs to the web, covering neural network fundamentals, the distinction between cloud and edge intelligence, implementation pipelines, offline model optimization, online inference backends like WebGL and WASM, and practical web front‑end AI use cases.

Web AIfrontendmachine learning

0 likes · 10 min read

Web Front‑End Intelligent Computing: Concepts, Implementation, and Applications

Ctrip Technology

Aug 13, 2020 · Artificial Intelligence

Hotel Recommendation System Architecture, Models, and Evaluation at Ctrip

This article presents a comprehensive overview of Ctrip's hotel recommendation system, covering its technical architecture, data processing pipelines, various ranking and embedding models—including FM, Wide&Deep, DeepFM, and FTRL—deployment methods such as PMML and TensorFlow Serving, offline and online evaluation results, and challenges like cold‑start and diversity.

CtripDeep LearningEmbedding

0 likes · 24 min read

Hotel Recommendation System Architecture, Models, and Evaluation at Ctrip

Meituan Technology Team

Jul 16, 2020 · Artificial Intelligence

Augur: An Online Model Inference Framework and Poker Platform for Meituan Search

Meituan’s AI‑driven search combines the Augur online inference framework—offering stateless, distributed feature operators, transformers, and a DSL for rapid, high‑throughput model scoring—with the Poker platform for model training, versioning, and experimentation, together accelerating iteration, improving performance, and enabling advanced model‑as‑feature ensembles.

AI platformSearch Enginefeature engineering

0 likes · 26 min read

Augur: An Online Model Inference Framework and Poker Platform for Meituan Search

58 Tech

Dec 20, 2019 · Artificial Intelligence

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

The article presents a comprehensive overview of 58.com’s AI platform built on Kubernetes, detailing its layered architecture, resource scheduling, offline training pipelines, debugging environment, distributed TensorFlow/PyTorch training, performance benchmarks, and online inference services, highlighting how the system empowers various business units with scalable AI capabilities.

KubernetesPyTorchTensorFlow

0 likes · 11 min read

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

DataFunTalk

Oct 11, 2019 · Artificial Intelligence

Building an End-to-End Federated Learning Pipeline Production Service with FATE-Flow

This article explains how to construct a high‑elastic, high‑performance end‑to‑end federated learning pipeline—including task scheduling, visual modeling, model management, version control, and online inference—using the FATE‑Flow platform to move from experimental ML to production deployment.

AIFATE-FlowModel Management

0 likes · 14 min read

Building an End-to-End Federated Learning Pipeline Production Service with FATE-Flow