Tag

inference optimization

0 views collected around this technical thread.

JD Retail Technology
JD Retail Technology
Apr 22, 2025 · Artificial Intelligence

Generative Large‑Model Architecture for JD Advertising: Practices, Challenges, and Optimization

JD’s advertising platform replaces rule‑based recall with a generative large‑model pipeline that unifies e‑commerce knowledge, multimodal user intent, and semantic IDs across recall, coarse‑ranking, fine‑ranking and creative optimization, while meeting sub‑100 ms latency and sub‑¥1‑per‑million‑token cost through quantization, parallelism, caching, and joint generative‑discriminative inference, delivering double‑digit performance gains and paving the way for domain‑specific foundation models.

Distributed SystemsLarge Modelsadvertising
0 likes · 20 min read
Generative Large‑Model Architecture for JD Advertising: Practices, Challenges, and Optimization
58 Tech
58 Tech
Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphTensorRTVisual Language Model
0 likes · 19 min read
Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
Java Architecture Diary
Java Architecture Diary
Mar 7, 2025 · Artificial Intelligence

Boost Inference Efficiency with QwQ-32B: Benchmarks, Resource Savings, and Java Integration

QwQ-32B, Alibaba’s new inference‑optimized large language model built on the Qwen2.5 architecture, outperforms DeepSeek‑R1 across math reasoning, code generation, and safety benchmarks while requiring only 24 GB vRAM, and the article provides detailed performance data, resource‑efficiency analysis, and step‑by‑step Java and Ollama integration instructions.

Function CallingJava integrationbenchmark
0 likes · 7 min read
Boost Inference Efficiency with QwQ-32B: Benchmarks, Resource Savings, and Java Integration
DeWu Technology
DeWu Technology
Feb 12, 2025 · Artificial Intelligence

Edge Intelligence for Intelligent Video Cover Recommendation

The article describes an edge‑based video‑cover recommendation system for DeWu that leverages the MNN SDK and a lightweight MobileNetV3 model, performing on‑device inference with quantization and parallel processing to automatically select high‑quality covers, achieving sub‑second latency and boosting click‑through rates by up to 18 %.

Model DeploymentVideo Coveredge AI
0 likes · 12 min read
Edge Intelligence for Intelligent Video Cover Recommendation
JD Retail Technology
JD Retail Technology
Feb 12, 2025 · Artificial Intelligence

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

JD Advertising accelerates its generative‑recall recommendation system by integrating NVIDIA TensorRT‑LLM, which simplifies the pipeline, injects LLM knowledge, scales to billions of parameters, and delivers over five‑fold throughput gains, one‑fifth the cost, and significant CTR improvements in both recommendation and search.

LLMRecommendation systemsTensorRT-LLM
0 likes · 13 min read
Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising
DataFunTalk
DataFunTalk
Jan 26, 2025 · Artificial Intelligence

58.com’s LingXi Large Language Model Platform: Development, Deployment, and Performance Optimizations

Since the launch of ChatGPT, 58.com has built a Model‑as‑a‑Service platform called LingXi that trains and serves domain‑specific large language models, supports over a hundred internal scenarios with daily inference exceeding ten million calls, and continuously improves performance through quantization, GPU optimization, model miniaturization, and advanced AI applications such as interview assistants, voice agents, and RAG‑enabled agents.

AI PlatformAI applicationsLLM
0 likes · 9 min read
58.com’s LingXi Large Language Model Platform: Development, Deployment, and Performance Optimizations
JD Tech Talk
JD Tech Talk
Jan 14, 2025 · Artificial Intelligence

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

This article explains how generative recommendation systems powered by large language models simplify the recommendation pipeline, integrate world knowledge, benefit from scaling laws, and require specialized engineering optimizations such as TensorRT‑LLM deployment, inference acceleration, and hybrid model strategies to achieve low latency and high throughput in real‑world e‑commerce scenarios.

AILLMTensorRT-LLM
0 likes · 10 min read
Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models
DataFunSummit
DataFunSummit
Dec 31, 2024 · Artificial Intelligence

How Momo Leverages Large Model Technology to Transform Business and R&D Processes

This article explains how Momo utilizes large language model technologies to revamp its AI application paradigm, achieve efficient inference through quantization and prefix caching, build a workflow‑based model platform, and outline future plans for framework optimization and multimodal support.

AI PlatformMOMOinference optimization
0 likes · 16 min read
How Momo Leverages Large Model Technology to Transform Business and R&D Processes
DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationGPUKVCache
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework
DataFunSummit
DataFunSummit
Nov 22, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

This article presents a comprehensive overview of EasyRec’s recommendation system architecture, detailing training and inference optimizations, embedding parallelism, CPU/GPU placement strategies, online learning pipelines, and network compression techniques that together improve scalability, latency, and cost efficiency.

Distributed SystemsEasyRecTraining Optimization
0 likes · 15 min read
EasyRec Recommendation Algorithm Training and Inference Optimization
DataFunSummit
DataFunSummit
Nov 4, 2024 · Artificial Intelligence

Performance Optimization Techniques for Large Model Inference Frameworks

This article outlines four key optimization areas for large model inference frameworks—quantization, speculative sampling, TTFT/TPOT improvements, and communication optimization—detailing specific techniques, experimental results, and practical benefits such as reduced memory usage, lower latency, and higher throughput.

AIinference optimizationlarge model
0 likes · 12 min read
Performance Optimization Techniques for Large Model Inference Frameworks
Sohu Tech Products
Sohu Tech Products
Aug 28, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

EasyRec, Alibaba Cloud’s modular recommendation framework, unifies configurable data, embedding, dense, and output layers on MaxCompute, EMR, and DLC, and speeds training with deduplication, EmbeddingParallel sharding, lock‑free hash tables, GPU embeddings, and AMX BF16, while inference benefits from operator fusion, low‑precision AVX/AMX kernels, compact caches, batch merging, and network compression, enabling real‑time online learning and delivering higher recommendation quality at lower cost in e‑commerce.

Alibaba CloudEasyRecTraining Optimization
0 likes · 14 min read
EasyRec Recommendation Algorithm Training and Inference Optimization
DataFunTalk
DataFunTalk
Aug 26, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

This article presents a comprehensive overview of EasyRec's recommendation system architecture, detailing training and inference optimizations, distributed deployment strategies, operator fusion techniques, online learning pipelines, and network-level improvements to enhance performance and scalability.

AIDistributed SystemsTraining Optimization
0 likes · 15 min read
EasyRec Recommendation Algorithm Training and Inference Optimization
Baidu Tech Salon
Baidu Tech Salon
May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI infrastructureAIAK-LLMBaidu Baige
0 likes · 26 min read
Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM
Baidu Geek Talk
Baidu Geek Talk
May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

AI infrastructureBaiduMFU
0 likes · 25 min read
Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 15, 2024 · Artificial Intelligence

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

By fusing sparse‑feature operators, enabling multi‑stream execution, consolidating data copies, and merging inference batches, iQIYI reduced GPU CTR model latency to CPU‑level, boosted throughput over sixfold, and cut operational costs by more than 40%, overcoming launch‑overhead bottlenecks.

GPUKernel FusionTensorFlow
0 likes · 10 min read
Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2024 · Artificial Intelligence

Qianfan Large Model Platform: Making Large Models Accessible - Baidu's Latest Work on Model Fine-tuning and Deployment

Baidu’s Qianfan Large Model Platform provides a one‑stop enterprise solution with 54 pre‑installed models, advanced fine‑tuning, comprehensive evaluation metrics, and optimized deployment that cuts costs up to 90% and boosts throughput 3‑5×, enabling rapid, affordable AI application development.

AI Native ApplicationsBaidu QianfanLarge Model Platform
0 likes · 12 min read
Qianfan Large Model Platform: Making Large Models Accessible - Baidu's Latest Work on Model Fine-tuning and Deployment
Alimama Tech
Alimama Tech
Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

GPU utilizationHigh Performance ComputingTensorRT
0 likes · 19 min read
Optimizing GPU Utilization for Multimedia AI Services with high_service
DataFunSummit
DataFunSummit
Apr 19, 2022 · Artificial Intelligence

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

This article reviews DeepSpeed‑MoE, an end‑to‑end system that introduces new MoE architectures, model‑compression techniques, and highly optimized inference pipelines, detailing its motivation, design of PR‑MoE (Pyramid‑MoE and Residual‑MoE), distributed parallel strategies, communication and kernel optimizations, and performance gains over dense baselines.

AIDeepSpeedMixture of Experts
0 likes · 11 min read
DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models
DataFunTalk
DataFunTalk
Dec 25, 2020 · Artificial Intelligence

Exploring Pretraining Model Optimization and Deployment Challenges in NLP

This article reviews the evolution of pretraining models in NLP, discusses the practical challenges of deploying large models such as inference latency, knowledge integration, and task adaptation, and presents Xiaomi’s optimization techniques including knowledge distillation, low‑precision inference, operator fusion, and multi‑granularity segmentation for dialogue systems.

BERTNLPPretraining
0 likes · 15 min read
Exploring Pretraining Model Optimization and Deployment Challenges in NLP