Tag

Inference Acceleration

0 views collected around this technical thread.

DataFunTalk
DataFunTalk
Apr 2, 2025 · Artificial Intelligence

Trends, Applications, and Future Directions of Large Models and Inference Acceleration

This article examines the current state and future prospects of large AI models and inference acceleration, covering technology trends, diverse application scenarios from research to industry, and the challenges and opportunities that lie ahead for intelligent data governance, multimodal agents, and AGI.

AGIAIData Governance
0 likes · 11 min read
Trends, Applications, and Future Directions of Large Models and Inference Acceleration
Bilibili Tech
Bilibili Tech
Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Hardware OptimizationInference Accelerationcontinuous batching
0 likes · 21 min read
Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies
360 Tech Engineering
360 Tech Engineering
Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI infrastructureGPU ClusterInference Acceleration
0 likes · 21 min read
Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference Acceleration
0 likes · 11 min read
Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference
Alimama Tech
Alimama Tech
May 15, 2024 · Artificial Intelligence

EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

EcomXL enhances SDXL for large‑scale e‑commerce image generation by leveraging tens of millions of curated images, a two‑stage fine‑tuning with denoising‑weighted distillation and layer‑wise fusion, specialized ControlNets for inpainting and soft‑edge consistency, and the SLAM inference accelerator to achieve sub‑second generation while boosting visual quality and adoption metrics.

AIGCControlNetEcomXL
0 likes · 15 min read
EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
DataFunTalk
DataFunTalk
May 8, 2024 · Artificial Intelligence

Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration

This talk explores how large‑model AI can give game NPCs personality, outlines the opportunities and challenges of intelligent NPCs, presents a case study of the "Jue Zhi An Nuan" NPC, and discusses future directions, safety compliance, and real‑time multimodal interaction solutions.

AIGame NPCInference Acceleration
0 likes · 3 min read
Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration
Tencent Cloud Developer
Tencent Cloud Developer
Dec 12, 2022 · Artificial Intelligence

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

Tencent Cloud’s OCR team cut average response time from 1.8 seconds to under one second and boosted throughput by over 50 % by redesigning the model with self‑attention, accelerating inference with a Tensor‑Network accelerator, shrinking RPC payloads, enabling asynchronous logging, and optimizing multi‑region GPU memory utilization.

AI modelCloud ServicesInference Acceleration
0 likes · 13 min read
Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic batchingInference Accelerationmodel compression
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
58 Tech
58 Tech
Jan 10, 2022 · Artificial Intelligence

Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

This article details the 58.com WPAI machine learning platform's architecture and the optimizations applied to training task scheduling, inference service elastic scaling, and offline‑online resource mixing, demonstrating how these techniques significantly improve GPU/CPU utilization and inference performance across both GPU and CPU environments.

AIInference AccelerationKubernetes
0 likes · 27 min read
Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)
Ctrip Technology
Ctrip Technology
Sep 16, 2021 · Artificial Intelligence

Automated AI Model Optimization Platform for Travel Services

This article describes the design, automated workflow, functional modules, and performance results of a comprehensive AI model optimization platform built for Ctrip's travel business, covering operator libraries, graph optimization, model compression techniques such as distillation, quantization, pruning, and deployment integration.

AI optimizationInference AccelerationautoML
0 likes · 16 min read
Automated AI Model Optimization Platform for Travel Services
HomeTech
HomeTech
Sep 4, 2019 · Artificial Intelligence

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.

Inference AccelerationTensorFlowTensorRT
0 likes · 13 min read
Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results